cuBLASMp C API#
Library Management#
cublasMpCreate#
cublasMpStatus_t cublasMpCreate(
cublasMpHandle_t *handle,
cudaStream_t stream);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
Out |
cuBLASMp library handle. |
stream |
Host |
In |
Stream that will be assigned to the handle. |
cublasMpDestroy#
cublasMpStatus_t cublasMpDestroy(
cublasMpHandle_t handle);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In/Out |
cuBLASMp library handle to destroy. |
cublasMpSetStream#
cublasMpStatus_t cublasMpSetStream(
cublasMpHandle_t handle,
cudaStream_t stream);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
stream |
Host |
In |
CUDA stream to assign to the handle. The stream must belong to the same CUDA context as the current handle stream. |
Note
In cuBLASMp v0.7.0 and below, this API was named cublasMpStreamSet.
cublasMpGetStream#
cublasMpStatus_t cublasMpGetStream(
cublasMpHandle_t handle,
cudaStream_t* stream);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
stream |
Host |
Out |
Pointer to receive the current CUDA stream. |
Note
In cuBLASMp v0.7.0 and below, this API was named cublasMpStreamGet.
cublasMpSetEmulationStrategy#
cublasMpStatus_t cublasMpSetEmulationStrategy(
cublasMpHandle_t handle,
cublasMpEmulationStrategy_t emulationStrategy);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
emulationStrategy |
Host |
In |
Emulation strategy to use. See cublasMpEmulationStrategy_t. |
cublasMpGetEmulationStrategy#
cublasMpStatus_t cublasMpGetEmulationStrategy(
cublasMpHandle_t handle,
cublasMpEmulationStrategy_t* emulationStrategy);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
emulationStrategy |
Host |
Out |
Pointer to receive the current emulation strategy. See cublasMpEmulationStrategy_t. |
Note
The fixed-point emulation control APIs below only affect operations that use CUBLAS_COMPUTE_64F_EMULATED_FIXEDPOINT.
The entry points that use cudaEmulationMantissaControl_t or cudaEmulationSpecialValuesSupport_t are declared only when compiling against CUDA Toolkit 13.0.2 or later because those CUDA enum types are introduced there.
The remaining fixed-point emulation control APIs are always declared, but older toolkit builds return CUBLASMP_STATUS_NOT_SUPPORTED.
cublasMpSetFixedPointEmulationMantissaControl#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpSetFixedPointEmulationMantissaControl(
cublasMpHandle_t handle,
cudaEmulationMantissaControl_t mantissaControl);
CUDA_EMULATION_MANTISSA_CONTROL_DYNAMIC, the library computes the required bit count at runtime to maintain accuracy equal to or better than native FP64.CUDA_EMULATION_MANTISSA_CONTROL_FIXED, a user-provided maximum bit count is used (see cublasMpSetFixedPointEmulationMaxMantissaBitCount).Note
FP64 fixed-point emulation is activated by passing CUBLAS_COMPUTE_64F_EMULATED_FIXEDPOINT as the compute
type to cublasMpMatmul, cublasMpGemm,
cublasMpSyrk, cublasMpSyr2k,
cublasMpSyrkx, cublasMpSymm,
cublasMpHerk, cublasMpHer2k,
cublasMpHerkx, cublasMpHemm,
cublasMpTrmm, or cublasMpTrsm. The handle-level
settings (mantissa control, bit count, bit offset, special values support) configure how emulation works
but are only applied when the emulated compute type is used. The emulation strategy
(cublasMpSetEmulationStrategy) controls when emulation is
used: PERFORMANT (default) may skip emulation on GPUs with fast native FP64, while EAGER forces
emulation whenever possible. Requires CUDA Toolkit 13.0.2 or later.
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
mantissaControl |
Host |
In |
Mantissa control mode ( |
cublasMpGetFixedPointEmulationMantissaControl#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpGetFixedPointEmulationMantissaControl(
cublasMpHandle_t handle,
cudaEmulationMantissaControl_t* mantissaControl);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
mantissaControl |
Host |
Out |
Pointer to receive the current mantissa control mode. |
cublasMpSetFixedPointEmulationMaxMantissaBitCount#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpSetFixedPointEmulationMaxMantissaBitCount(
cublasMpHandle_t handle,
int bitCount);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
bitCount |
Host |
In |
Maximum number of mantissa bits. |
cublasMpGetFixedPointEmulationMaxMantissaBitCount#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpGetFixedPointEmulationMaxMantissaBitCount(
cublasMpHandle_t handle,
int* bitCount);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
bitCount |
Host |
Out |
Pointer to receive the current maximum mantissa bit count. |
cublasMpSetFixedPointEmulationMantissaBitOffset#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpSetFixedPointEmulationMantissaBitOffset(
cublasMpHandle_t handle,
int bitOffset);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
bitOffset |
Host |
In |
Mantissa bit offset to apply (default: 0). |
cublasMpGetFixedPointEmulationMantissaBitOffset#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpGetFixedPointEmulationMantissaBitOffset(
cublasMpHandle_t handle,
int* bitOffset);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
bitOffset |
Host |
Out |
Pointer to receive the current mantissa bit offset. |
cublasMpSetEmulationSpecialValuesSupport#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpSetEmulationSpecialValuesSupport(
cublasMpHandle_t handle,
cudaEmulationSpecialValuesSupport_t mask);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
mask |
Host |
In |
Bitmask of |
cublasMpGetEmulationSpecialValuesSupport#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpGetEmulationSpecialValuesSupport(
cublasMpHandle_t handle,
cudaEmulationSpecialValuesSupport_t* mask);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
mask |
Host |
Out |
Pointer to receive the current special values support bitmask. |
cublasMpSetFixedPointEmulationMantissaBitCountPointer#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpSetFixedPointEmulationMantissaBitCountPointer(
cublasMpHandle_t handle,
int* ptr);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
ptr |
Device |
In |
Device pointer to an |
cublasMpGetFixedPointEmulationMantissaBitCountPointer#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpGetFixedPointEmulationMantissaBitCountPointer(
cublasMpHandle_t handle,
int** ptr);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
ptr |
Host |
Out |
Pointer to receive the current device pointer ( |
cublasMpGetVersion#
cublasMpStatus_t cublasMpGetVersion(
int *version);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
version |
Host |
Out |
cuBLASMp library version. Value is |
cublasMpGetStatusString#
Added in cuBLASMp 0.8.0
const char* cublasMpGetStatusString(
cublasMpStatus_t status);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
status |
Host |
In |
The cublasMpStatus_t status code to convert to a string. |
Note
The returned string is a static string and remains valid for the lifetime of the application.
For unknown status codes, the function returns a string indicating an unknown status.
Grid Management#
cublasMpGridCreate#
cublasMpStatus_t cublasMpGridCreate(
int64_t nprow,
int64_t npcol,
cublasMpGridLayout_t layout,
ncclComm_t comm,
cublasMpGrid_t* grid);
comm must contain exactly nprow * npcol participating ranks.nprow, npcol, and layout values.Note
Releases prior to cuBLASMp 0.8.0 used NVSHMEM during grid setup. Current releases use NCCL symmetric memory instead; refer to the release notes only if you are maintaining an older branch.
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
nprow |
Host |
In |
Number of row processes in the grid. |
npcol |
Host |
In |
Number of column processes in the grid. |
layout |
Host |
In |
Grid’s layout (cublasMpGridLayout_t). |
comm |
Host |
In |
NCCL communicator associated with the grid. Its size must equal |
grid |
Host |
In/Out |
Pointer to a grid object. |
cublasMpGridDestroy#
cublasMpStatus_t cublasMpGridDestroy(
cublasMpGrid_t grid);
grid object.Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
grid |
Host |
In/Out |
Grid object to destroy. |
Memory Management#
cuBLASMp uses NCCL Symmetric Memory to enable the high-performance AllGather+GEMM, GEMM+ReduceScatter, and GEMM+AllReduce
algorithms as part of cublasMpMatmul. When the device workspace (d_work) is
allocated with symmetric memory and registered with the output process grid, the library can use
these specialized algorithms for better communication-computation overlap. Without a registered
symmetric workspace, cublasMpMatmul falls back to the
no_overlap algorithms or returns CUBLASMP_STATUS_NOT_SUPPORTED. Other operations (like cublasMpGemm,
cublasMpTrsm, cublasMpSyrk, etc) do not
use symmetric memory and their workspaces can be allocated with standard cudaMalloc, or any other CUDA memory allocator.
Use cublasMpMalloc (recommended) to allocate and register the
workspace in one call, or allocate with ncclMemAlloc / CUDA VMM APIs and register using
the cublasMpBufferRegister API.
// Allocate workspace with symmetric memory to enable AG+GEMM, GEMM+RS, GEMM+AR
size_t workspaceInBytesOnDevice = 0, workspaceInBytesOnHost = 0;
cublasMpMatmul_bufferSize(handle, matmulDesc, m, n, k, &alpha,
d_A, 1, 1, descA, d_B, 1, 1, descB, &beta,
d_C, 1, 1, descC, d_D, 1, 1, descD,
&workspaceInBytesOnDevice, &workspaceInBytesOnHost);
void* d_work = nullptr;
cublasMpMalloc(grid, &d_work, workspaceInBytesOnDevice); // collective
// ... call cublasMpMatmul ...
cublasMpFree(grid, d_work); // collective
Input and output matrices (A, B, C, D and their associated scale tensors)
can be allocated with any standard CUDA allocation function (cudaMalloc,
cudaMallocAsync, CUDA VMM, etc). Registration is not required.
As an optional performance optimization for the AllGather+GEMM algorithm: allocating
matrix B with symmetric memory (cublasMpMalloc or ncclMemAlloc) allows the library
to skip copying B into the AllGather workspace.
Note
All registration and allocation functions in this section are collective: every process in the grid must call them together.
Symmetric memory requires NVLink/NVSwitch interconnects and NCCL 2.29.2+. On systems where NCCL does not support it,
cublasMpMallocandcublasMpBufferRegisterreturnCUBLASMP_STATUS_NOT_SUPPORTED, andcublasMpMatmulautomatically falls back to theno_overlapalgorithm when using theDEFAULTalgorithm type, or returnsCUBLASMP_STATUS_NOT_SUPPORTEDfor explicit pipelined algorithm requests.
cublasMpBufferRegister#
Added in cuBLASMp 0.8.0
cublasMpStatus_t cublasMpBufferRegister(
cublasMpGrid_t grid,
void* ptr,
size_t size);
d_work) with the process grid enables cublasMpMatmul to use the AllGather+GEMM, GEMM+ReduceScatter, and GEMM+AllReduce algorithms. Without registration, the library falls back to the no_overlap algorithms or returns CUBLASMP_STATUS_NOT_SUPPORTED. Input/output matrix buffers do not need to be registered; registration is optional and only improves performance for the B matrix in the AllGather+GEMM algorithm.Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
grid |
Host |
In |
Process grid to register the buffer with. |
ptr |
Device |
In |
Pointer to device memory buffer to register. |
size |
Host |
In |
Size of the buffer in bytes. |
Note
- The buffer must be allocated using one of the following compatible methods:
ncclMemAlloc— NCCL symmetric memory allocationCUDA Virtual Memory Management (VMM) APIs (
cuMemCreate,cuMemMap, etc.)Other CUDA memory allocation functions that produce VMM-compatible allocations
The buffer must be registered on all processes in the grid.
If NCCL symmetric memory is not supported on the system, this function returns
CUBLASMP_STATUS_NOT_SUPPORTED.Alternatively, use the cublasMpMalloc convenience function that combines allocation and registration.
cublasMpBufferDeregister#
Added in cuBLASMp 0.8.0
cublasMpStatus_t cublasMpBufferDeregister(
cublasMpGrid_t grid,
void* ptr);
ncclMemFree.Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
grid |
Host |
In |
Process grid the buffer was registered with. |
ptr |
Device |
In |
Pointer to device memory buffer to deregister. |
Note
The buffer must have been previously registered with cublasMpBufferRegister.
Deregister on all processes in the grid before freeing the memory.
For buffers allocated with cublasMpMalloc, use cublasMpFree instead, which combines deregistration and freeing.
cublasMpMalloc#
Added in cuBLASMp 0.8.0
cublasMpStatus_t cublasMpMalloc(
cublasMpGrid_t grid,
void** ptr,
size_t size);
d_work) for cublasMpMatmul. It allocates device memory using NCCL symmetric memory and automatically registers it with the specified process grid, combining ncclMemAlloc and cublasMpBufferRegister into a single call. When the workspace is registered, the library may use the AllGather+GEMM, GEMM+ReduceScatter, and GEMM+AllReduce algorithms; without registration, it falls back to no_overlap or returns CUBLASMP_STATUS_NOT_SUPPORTED.ncclMemAlloc, CUDA Virtual Memory Management (VMM) APIs (cuMemCreate, cuMemMap, etc.), or other VMM-compatible CUDA memory allocation functions, followed by manual registration with cublasMpBufferRegister.Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
grid |
Host |
In |
Process grid to allocate and register the buffer with. |
ptr |
Host |
Out |
Pointer to receive the allocated device memory pointer. |
size |
Host |
In |
Size of the buffer to allocate in bytes. |
Note
Enables split algorithms: Allocating the workspace with this function allows cublasMpMatmul to use the AllGather+GEMM, GEMM+ReduceScatter, and GEMM+AllReduce algorithms. Without a registered workspace, the library falls back to
no_overlap. For cublasMpGemm, cublasMpTrsm, and cublasMpSyrk, standardcudaMallocis always sufficient.- Alternative approaches: Users can instead use:
ncclMemAlloc+ cublasMpBufferRegisterCUDA Virtual Memory Management (VMM) APIs + cublasMpBufferRegister
Use cublasMpFree to free buffers allocated with this function.
If NCCL symmetric memory is not supported, this function returns
CUBLASMP_STATUS_NOT_SUPPORTED.
cublasMpFree#
Added in cuBLASMp 0.8.0
cublasMpStatus_t cublasMpFree(
cublasMpGrid_t grid,
void* ptr);
ncclMemFree into a single call.Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
grid |
Host |
In |
Process grid the buffer was allocated with. |
ptr |
Device |
In |
Pointer to device memory buffer to free. |
Note
This function should only be used for buffers allocated with cublasMpMalloc.
For buffers allocated manually with
ncclMemAllocor CUDA VMM, use cublasMpBufferDeregister followed by the appropriate free function.
Matrix Management#
cublasMpMatrixDescriptorCreate#
cublasMpStatus_t cublasMpMatrixDescriptorCreate(
int64_t m,
int64_t n,
int64_t mb,
int64_t nb,
int64_t rsrc,
int64_t csrc,
int64_t lld,
cudaDataType_t type,
cublasMpGrid_t grid,
cublasMpMatrixDescriptor_t* desc);
grid must be a valid grid created with cublasMpGridCreate and must outlive the descriptor.grid == NULL on ranks that do not participate in the source or destination matrix grid.mb, nb, and lld are expressed in elements and must be positive.grid is not NULL, rsrc must be in [0, nprow) and csrc in [0, npcol) for the associated grid.lld must describe the leading dimension of the local allocation on the calling rank. For a full local matrix without extra padding, a common minimal choice is max(1, cublasMpNumroc(m, mb, myprow, rsrc, nprow)).Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
m |
Host |
In |
Number of rows in the global matrix. |
n |
Host |
In |
Number of columns in the global matrix. |
mb |
Host |
In |
Positive row blocking factor used to distribute the rows of the global matrix. |
nb |
Host |
In |
Positive column blocking factor used to distribute the columns of the global matrix. |
rsrc |
Host |
In |
Row rank of the process that owns the first row block of the global matrix. When |
csrc |
Host |
In |
Column rank of the process that owns the first column block of the global matrix. When |
lld |
Host |
In |
Positive leading dimension of the local matrix. Typically at least |
type |
Host |
In |
Data type of the matrix. |
grid |
Host |
In |
Grid object associated with the matrix descriptor, or NULL on redistribution-only ranks that do not participate in this matrix’s grid. |
desc |
Host |
Out |
Matrix descriptor object initialized by this function. |
type argument are listed below:Data Type |
Description |
|---|---|
CUDA_R_8I |
8-bit real signed integer. |
CUDA_R_32I |
32-bit real signed integer. |
CUDA_R_4F_E2M1 |
4-bit real floating point in E2M1 format. |
CUDA_R_8F_E4M3 |
8-bit real floating point in E4M3 format. |
CUDA_R_8F_E5M2 |
8-bit real floating point in E5M2 format. |
CUDA_R_16F |
16-bit real half precision floating-point. |
CUDA_R_16BF |
16-bit real bfloat16 floating-point. |
CUDA_R_32F |
32-bit real single precision floating-point. |
CUDA_R_64F |
64-bit real double precision floating-point. |
CUDA_C_32F |
64-bit structure comprised of two single precision floating-points representing a complex number. |
CUDA_C_64F |
128-bit structure comprised of two double precision floating-points representing a complex number. |
cublasMpMatrixDescriptorDestroy#
cublasMpStatus_t cublasMpMatrixDescriptorDestroy(
cublasMpMatrixDescriptor_t desc);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
desc |
Host |
In/Out |
Matrix descriptor object to destroy. |
cublasMpMatrixDescriptorInit#
cublasMpStatus_t cublasMpMatrixDescriptorInit(
int64_t m,
int64_t n,
int64_t mb,
int64_t nb,
int64_t rsrc,
int64_t csrc,
int64_t lld,
cudaDataType_t type,
cublasMpGrid_t grid,
cublasMpMatrixDescriptor_t desc);
mb/nb/lld must be positive and, when grid is not NULL, rsrc/csrc must be in range for that grid.grid == NULL on ranks that do not participate in the source or destination matrix grid.Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
m |
Host |
In |
Number of rows in the global matrix. |
n |
Host |
In |
Number of columns in the global matrix. |
mb |
Host |
In |
Positive row blocking factor used to distribute the rows of the global matrix. |
nb |
Host |
In |
Positive column blocking factor used to distribute the columns of the global matrix. |
rsrc |
Host |
In |
Row rank of the process that owns the first row block of the global matrix. When |
csrc |
Host |
In |
Column rank of the process that owns the first column block of the global matrix. When |
lld |
Host |
In |
Positive leading dimension of the local matrix. Typically at least |
type |
Host |
In |
Data type of the matrix. |
grid |
Host |
In |
Grid object associated with the matrix descriptor, or NULL on redistribution-only ranks that do not participate in this matrix’s grid. |
desc |
Host |
In/Out |
Matrix descriptor object initialized by this function. |
type argument are listed below:Data Type |
Description |
|---|---|
CUDA_R_8I |
8-bit real signed integer. |
CUDA_R_32I |
32-bit real signed integer. |
CUDA_R_4F_E2M1 |
4-bit real floating point in E2M1 format. |
CUDA_R_8F_E4M3 |
8-bit real floating point in E4M3 format. |
CUDA_R_8F_E5M2 |
8-bit real floating point in E5M2 format. |
CUDA_R_16F |
16-bit real half precision floating-point. |
CUDA_R_16BF |
16-bit real bfloat16 floating-point. |
CUDA_R_32F |
32-bit real single precision floating-point. |
CUDA_R_64F |
64-bit real double precision floating-point. |
CUDA_C_32F |
64-bit structure comprised of two single precision floating-points representing a complex number. |
CUDA_C_64F |
128-bit structure comprised of two double precision floating-points representing a complex number. |
Matmul Properties#
cublasMpMatmulDescriptorCreate#
cublasMpStatus_t cublasMpMatmulDescriptorCreate(
cublasMpMatmulDescriptor_t* matmulDesc,
cublasComputeType_t computeType);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
matmulDesc |
Host |
In/Out |
Pointer to a cublasMpMatmulDescriptor object to initialize. |
computeType |
Host |
In |
cuBLAS compute type used for computations. See table below for supported combinations. |
computeType argument are listed below:Compute Types |
|---|
CUBLAS_COMPUTE_32I |
CUBLAS_COMPUTE_32I_PEDANTIC |
CUBLAS_COMPUTE_16F |
CUBLAS_COMPUTE_16F_PEDANTIC |
CUBLAS_COMPUTE_32F |
CUBLAS_COMPUTE_32F_PEDANTIC |
CUBLAS_COMPUTE_32F_FAST_16F |
CUBLAS_COMPUTE_32F_FAST_16BF |
CUBLAS_COMPUTE_32F_FAST_TF32 |
CUBLAS_COMPUTE_32F_EMULATED_16BFX9 |
CUBLAS_COMPUTE_64F |
CUBLAS_COMPUTE_64F_PEDANTIC |
CUBLAS_COMPUTE_64F_EMULATED_FIXEDPOINT (requires CUDA Toolkit 13.0.2+) |
cublasMpMatmulDescriptorDestroy#
cublasMpStatus_t cublasMpMatmulDescriptorDestroy(
cublasMpMatmulDescriptor_t matmulDesc);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
matmulDesc |
Host |
In/Out |
Matmul descriptor object to destroy. |
cublasMpMatmulDescriptorInit#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpMatmulDescriptorInit(
cublasMpMatmulDescriptor_t matmulDesc,
cublasComputeType_t computeType);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
matmulDesc |
Host |
In/Out |
Matmul descriptor object to reinitialize. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
cublasMpMatmulDescriptorSetAttribute#
cublasMpStatus_t cublasMpMatmulDescriptorSetAttribute(
cublasMpMatmulDescriptor_t matmulDesc,
cublasMpMatmulDescriptorAttribute_t attr,
const void* buf,
size_t sizeInBytes);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
matmulDesc |
Host |
In |
Matmul descriptor object to set its attribute. |
attr |
Host |
In |
Matmul descriptor attribute to set. |
buf |
Host |
In |
Attribute value to set. |
sizeInBytes |
Host |
In |
Attribute buffer size in bytes. |
Note
In cuBLASMp v0.7.0 and below, this API was named cublasMpMatmulDescriptorAttributeSet.
cublasMpMatmulDescriptorGetAttribute#
cublasMpStatus_t cublasMpMatmulDescriptorGetAttribute(
cublasMpMatmulDescriptor_t matmulDesc,
cublasMpMatmulDescriptorAttribute_t attr,
void* buf,
size_t sizeInBytes,
size_t* sizeWritten);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
matmulDesc |
Host |
In |
Matmul descriptor object to query. |
attr |
Host |
In |
Matmul descriptor attribute to query. |
buf |
Host |
Out |
Buffer to receive the attribute value. |
sizeInBytes |
Host |
In |
Attribute buffer size in bytes. |
sizeWritten |
Host |
Out |
Size of the attribute written into |
Note
In cuBLASMp v0.7.0 and below, this API was named cublasMpMatmulDescriptorAttributeGet.
Utility#
cublasMpNumroc#
int64_t cublasMpNumroc(
int64_t n,
int64_t nb,
uint32_t iproc,
uint32_t isrcproc,
uint32_t nprocs);
iproc argument.iproc and isrcproc are zero-based process coordinates and must be in the range [0, nprocs).nb and nprocs must be positive.Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
n |
Host |
In |
Number of rows or columns in the global distributed matrix. |
nb |
Host |
In |
Row or column blocking size of the global matrix. |
iproc |
Host |
In |
The coordinate of the process whose local array row or column is to be determined. |
isrcproc |
Host |
In |
The coordinate of the process that owns the first row or column of the distributed matrix. |
nprocs |
Host |
In |
The total number of row or column processes over which the matrix is distributed. |
cublasMpGemr2D#
cublasMpStatus_t cublasMpGemr2D(
cublasMpHandle_t handle,
int64_t m,
int64_t n,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost,
ncclComm_t global_comm);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
m |
Host |
In |
Number of rows of sub(A) and sub(B). |
n |
Host |
In |
Number of columns of sub(A) and sub(B). |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A. |
b |
Device |
Out |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpGemr2D_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpGemr2D_bufferSize(). |
global_comm |
Host |
In |
A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix. |
cublasMpGemr2D_bufferSize#
cublasMpStatus_t cublasMpGemr2D_bufferSize(
cublasMpHandle_t handle,
int64_t m,
int64_t n,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost,
ncclComm_t global_comm);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
m |
Host |
In |
Number of rows of sub(A) and sub(B). |
n |
Host |
In |
Number of columns of sub(A) and sub(B). |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpGemr2D(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpGemr2D(). |
global_comm |
Host |
In |
A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix. |
cublasMpTrmr2D#
cublasMpStatus_t cublasMpTrmr2D(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasDiagType_t diag,
int64_t m,
int64_t n,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost,
ncclComm_t global_comm);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements. |
diag |
Host |
In |
Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed. |
m |
Host |
In |
Number of rows of sub(A) and sub(B). |
n |
Host |
In |
Number of columns of sub(A) and sub(B). |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A. |
b |
Device |
Out |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpTrmr2D_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpTrmr2D_bufferSize(). |
global_comm |
Host |
In |
A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix. |
cublasMpTrmr2D_bufferSize#
cublasMpStatus_t cublasMpTrmr2D_bufferSize(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasDiagType_t diag,
int64_t m,
int64_t n,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost,
ncclComm_t global_comm);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements. |
diag |
Host |
In |
Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed. |
m |
Host |
In |
Number of rows of sub(A) and sub(B). |
n |
Host |
In |
Number of columns of sub(A) and sub(B). |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. descA’s grid value must be set to null in processes that are not part of the grid of A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. descB’s grid value must be set to null in processes that are not part of the grid of B. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpTrmr2D(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpTrmr2D(). |
global_comm |
Host |
In |
A communicator containing at least the union of all processes in the communicators of A and B. All processes in the communicator must call this function, even if they do not own a piece of either matrix. |
Logging#
cublasMpLoggerSetCallback#
cublasMpStatus_t cublasMpLoggerSetCallback(
cublasMpLoggerCallback_t callback);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
callback |
Host |
In |
Pointer to a callback function. See cublasMpLoggerCallback_t. |
Warning
This is an experimental feature.
cublasMpLoggerSetFile#
cublasMpStatus_t cublasMpLoggerSetFile(
FILE *file);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
file |
Host |
In |
Pointer to an open file. File should have write permission. |
Warning
This is an experimental feature.
cublasMpLoggerOpenFile#
cublasMpStatus_t cublasMpLoggerOpenFile(
const char* logFile);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
logFile |
Host |
In |
Path of the logging output file. |
Warning
This is an experimental feature.
cublasMpLoggerSetLevel#
cublasMpStatus_t cublasMpLoggerSetLevel(
int level);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
level |
Host |
In |
Value of the logging level. See cuBLASMp Logging. |
Warning
This is an experimental feature.
cublasMpLoggerSetMask#
cublasMpStatus_t cublasMpLoggerSetMask(
int mask);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
mask |
Host |
In |
Value of the logging mask. See cuBLASMp Logging. |
Warning
This is an experimental feature.
cublasMpLoggerForceDisable#
cublasMpStatus_t cublasMpLoggerForceDisable();
Warning
This is an experimental feature.
Dense Linear Algebra APIs#
cublasMpTrsm#
cublasMpStatus_t cublasMpTrsm(
cublasMpHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
cublasDiagType_t diag,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
cublasComputeType_t computeType,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(\left\{ \begin{matrix} {\text{op}(A)X = \alpha B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\ {X\text{op}(A) = \alpha B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\ \end{matrix} \right.\)
\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
side |
Host |
In |
Indicates if matrix A is on the left or right of X. |
uplo |
Host |
In |
Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements. |
trans |
Host |
In |
Operation op(A) that is non- or (conj.) transpose. |
diag |
Host |
In |
Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed. |
m |
Host |
In |
Number of rows of matrix sub(B), with matrix sub(A) sized accordingly. |
n |
Host |
In |
Number of columns of matrix sub(B), with matrix sub(A) is sized accordingly. |
alpha |
Host |
In |
Scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In/Out |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
computeType |
Host |
In |
cuBLAS compute type used for computations. See table below for supported combinations. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpTrsm_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpTrsm_bufferSize(). |
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpTrsm_bufferSize#
cublasMpStatus_t cublasMpTrsm_bufferSize(
cublasMpHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
cublasDiagType_t diag,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
cublasComputeType_t computeType,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
side |
Host |
In |
Indicates if matrix A is on the left or right of X. |
uplo |
Host |
In |
Indicates if matrix A lower or upper part is stored, the other part is not referenced and is inferred from the stored elements. |
trans |
Host |
In |
Operation op(A) that is non- or (conj.) transpose. |
diag |
Host |
In |
Indicates if the elements on the main diagonal of matrix A are unity and should not be accessed. |
m |
Host |
In |
Number of rows of matrix sub(B), with matrix sub(A) sized accordingly. |
n |
Host |
In |
Number of columns of matrix sub(B), with matrix sub(A) is sized accordingly. |
alpha |
Host |
In |
Scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
computeType |
Host |
In |
cuBLAS compute type used for computations. See table below for supported combinations. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpTrsm(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpTrsm(). |
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpGemm#
cublasMpStatus_t cublasMpGemm(
cublasMpHandle_t handle,
cublasOperation_t transA,
cublasOperation_t transB,
int64_t m,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha\text{op}(A)\text{op}(B) + \beta C\)
\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{transA == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{transA == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{transA == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
transA |
Host |
In |
Operation op(A) that is non- or (conj.) transpose. |
transB |
Host |
In |
Operation op(B) that is non- or (conj.) transpose. |
m |
Host |
In |
Number of rows of sub(A) and sub(C). |
n |
Host |
In |
Number of columns of sub(B) and sub(C). |
k |
Host |
In |
Number of columns of sub(A) and rows of sub(B). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In/Out |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. See table below for supported combinations. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpGemm_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpGemm_bufferSize(). |
Note
This routine will internally call cublasMpMatmul() with d == c.
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_16F or CUBLAS_COMPUTE_16F_PEDANTIC |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
CUBLAS_COMPUTE_32F or CUBLAS_COMPUTE_32F_PEDANTIC |
CUDA_R_32F |
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_8I |
CUDA_R_32F |
||
CUDA_R_16BF |
CUDA_R_32F |
||
CUDA_R_16F |
CUDA_R_32F |
||
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_C_32F |
CUDA_C_8I |
CUDA_C_32F |
|
CUDA_C_32F |
CUDA_C_32F |
||
CUBLAS_COMPUTE_32F_FAST_16F or CUBLAS_COMPUTE_32F_FAST_16BF or CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F or CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpGemm_bufferSize#
cublasMpStatus_t cublasMpGemm_bufferSize(
cublasMpHandle_t handle,
cublasOperation_t transA,
cublasOperation_t transB,
int64_t m,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
transA |
Host |
In |
Operation op(A) that is non- or (conj.) transpose. |
transB |
Host |
In |
Operation op(B) that is non- or (conj.) transpose. |
m |
Host |
In |
Number of rows of sub(A) and sub(C). |
n |
Host |
In |
Number of columns of sub(B) and sub(C). |
k |
Host |
In |
Number of columns of sub(A) and rows of sub(B). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. See table below for supported combinations. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpGemm(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpGemm(). |
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_16F or CUBLAS_COMPUTE_16F_PEDANTIC |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
CUBLAS_COMPUTE_32F or CUBLAS_COMPUTE_32F_PEDANTIC |
CUDA_R_32F |
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_8I |
CUDA_R_32F |
||
CUDA_R_16BF |
CUDA_R_32F |
||
CUDA_R_16F |
CUDA_R_32F |
||
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_C_32F |
CUDA_C_8I |
CUDA_C_32F |
|
CUDA_C_32F |
CUDA_C_32F |
||
CUBLAS_COMPUTE_32F_FAST_16F or CUBLAS_COMPUTE_32F_FAST_16BF or CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F or CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpMatmul#
cublasMpStatus_t cublasMpMatmul(
cublasMpHandle_t handle,
cublasMpMatmulDescriptor_t matmulDesc,
int64_t m,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
const void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
void* d,
int64_t id,
int64_t jd,
cublasMpMatrixDescriptor_t descD,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(D = \alpha\text{op}(A)\text{op}(B) + \beta C\)
\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{TRANSA == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{TRANSA == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{TRANSA == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)
FP8 Support:
FP8 matrix multiplication is only supported for the
TNformat, i.e. \(A^T * B\),CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_TRANSA == CUBLAS_OP_T,CUBLASMP_MATMUL_DESCRIPTOR_ATTRIBUTE_TRANSB == CUBLAS_OP_Non Compute Capability 9.0+ GPUs.To use tensor-scaled FP8 kernels, the following set of requirements must be satisfied:
All matrix dimensions must meet the optimal requirements listed in Tensor Core Usage (i.e. pointers and matrix dimension must support 16-byte alignment).
The compute type must be
CUBLAS_COMPUTE_32F.The scale type must be
CUDA_R_32F.
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
matmulDesc |
Host |
In |
Descriptor of the operation to perform, created with cublasMpMatmulDescriptorCreate(). |
m |
Host |
In |
Number of rows of sub(A) and sub(C). |
n |
Host |
In |
Number of columns of sub(B) and sub(C). |
k |
Host |
In |
Number of columns of sub(A) and rows of sub(B). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. May be |
d |
Device |
Out |
Pointer to the first entry of the local portion of the global matrix D. |
id |
Host |
In |
Row index of the first row of the sub(D). |
jd |
Host |
In |
Column index of the first column of the sub(D). |
descD |
Host |
In |
Matrix descriptor associated to the global matrix D. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpMatmul_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpMatmul_bufferSize(). |
Compute Type |
Scale Type (alpha and beta) |
Atype |
Btype |
Ctype |
Dtype |
|---|---|---|---|---|---|
CUBLAS_COMPUTE_16F or CUBLAS_COMPUTE_16F_PEDANTIC |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
CUBLAS_COMPUTE_32F or CUBLAS_COMPUTE_32F_PEDANTIC |
CUDA_R_32F |
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_8I |
CUDA_R_8I |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_C_32F |
CUDA_C_8I |
CUDA_C_8I |
CUDA_C_32F |
CUDA_C_32F |
|
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
||
CUBLAS_COMPUTE_32F |
CUDA_R_32F |
CUDA_R_8F_E4M3 |
CUDA_R_8F_E4M3 |
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_8F_E4M3 |
CUDA_R_8F_E4M3 |
CUDA_R_16BF |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E4M3 |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E4M3 |
CUDA_R_16F |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E4M3 |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16BF |
CUDA_R_16BF |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16BF |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16BF |
CUDA_R_8F_E5M2 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16F |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16F |
CUDA_R_8F_E5M2 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16BF |
CUDA_R_16BF |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16BF |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16BF |
CUDA_R_8F_E5M2 |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16F |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16F |
CUDA_R_8F_E5M2 |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_4F_E2M1 |
CUDA_R_4F_E2M1 |
CUDA_R_16BF |
CUDA_R_16BF |
||
CUDA_R_4F_E2M1 |
CUDA_R_4F_E2M1 |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_4F_E2M1 |
CUDA_R_4F_E2M1 |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_4F_E2M1 |
CUDA_R_4F_E2M1 |
CUDA_R_16BF |
CUDA_R_4F_E2M1 |
||
CUDA_R_4F_E2M1 |
CUDA_R_4F_E2M1 |
CUDA_R_16F |
CUDA_R_4F_E2M1 |
||
CUBLAS_COMPUTE_32F_FAST_16F or CUBLAS_COMPUTE_32F_FAST_16BF or CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F or CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
NVFP4 requirements#
NVFP4 requires Compute Capability 10.0 and above.
Compute Type must be CUBLAS_COMPUTE_32F.
Scale Type must be CUDA_R_32F.
Scaling mode must be CUBLASMP_MATMUL_MATRIX_SCALE_VEC16_UE4M3.
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpMatmul_bufferSize#
cublasMpStatus_t cublasMpMatmul_bufferSize(
cublasMpHandle_t handle,
cublasMpMatmulDescriptor_t matmulDesc,
int64_t m,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
const void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
void* d,
int64_t id,
int64_t jd,
cublasMpMatrixDescriptor_t descD,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
matmulDesc |
Host |
In |
Descriptor of the operation to perform, created with cublasMpMatmulDescriptorCreate(). |
m |
Host |
In |
Number of rows of sub(A) and sub(C). |
n |
Host |
In |
Number of columns of sub(B) and sub(C). |
k |
Host |
In |
Number of columns of sub(A) and rows of sub(B). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global matrix C. May be |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. May be |
d |
Device |
Out |
Pointer to the first entry of the local portion of the global matrix D. |
id |
Host |
In |
Row index of the first row of the sub(D). |
jd |
Host |
In |
Column index of the first column of the sub(D). |
descD |
Host |
In |
Matrix descriptor associated to the global matrix D. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpMatmul(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpMatmul(). |
Compute Type |
Scale Type (alpha and beta) |
Atype |
Btype |
Ctype |
Dtype |
|---|---|---|---|---|---|
CUBLAS_COMPUTE_16F or CUBLAS_COMPUTE_16F_PEDANTIC |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
CUBLAS_COMPUTE_32F or CUBLAS_COMPUTE_32F_PEDANTIC |
CUDA_R_32F |
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_8I |
CUDA_R_8I |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_16F |
CUDA_R_16F |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_C_32F |
CUDA_C_8I |
CUDA_C_8I |
CUDA_C_32F |
CUDA_C_32F |
|
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
||
CUBLAS_COMPUTE_32F |
CUDA_R_32F |
CUDA_R_8F_E4M3 |
CUDA_R_8F_E4M3 |
CUDA_R_16BF |
CUDA_R_16BF |
CUDA_R_8F_E4M3 |
CUDA_R_8F_E4M3 |
CUDA_R_16BF |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E4M3 |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E4M3 |
CUDA_R_16F |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E4M3 |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16BF |
CUDA_R_16BF |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16BF |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16BF |
CUDA_R_8F_E5M2 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16F |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_16F |
CUDA_R_8F_E5M2 |
||
CUDA_R_8F_E4M3 |
CUDA_R_8F_E5M2 |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16BF |
CUDA_R_16BF |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16BF |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16BF |
CUDA_R_8F_E5M2 |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16F |
CUDA_R_8F_E4M3 |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_16F |
CUDA_R_8F_E5M2 |
||
CUDA_R_8F_E5M2 |
CUDA_R_8F_E4M3 |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_4F_E2M1 |
CUDA_R_4F_E2M1 |
CUDA_R_16BF |
CUDA_R_16BF |
||
CUDA_R_4F_E2M1 |
CUDA_R_4F_E2M1 |
CUDA_R_16F |
CUDA_R_16F |
||
CUDA_R_4F_E2M1 |
CUDA_R_4F_E2M1 |
CUDA_R_32F |
CUDA_R_32F |
||
CUDA_R_4F_E2M1 |
CUDA_R_4F_E2M1 |
CUDA_R_16BF |
CUDA_R_4F_E2M1 |
||
CUDA_R_4F_E2M1 |
CUDA_R_4F_E2M1 |
CUDA_R_16F |
CUDA_R_4F_E2M1 |
||
CUBLAS_COMPUTE_32F_FAST_16F or CUBLAS_COMPUTE_32F_FAST_16BF or CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F or CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
NVFP4 requirements#
NVFP4 requires Compute Capability 10.0 and above.
Compute Type must be CUBLAS_COMPUTE_32F.
Scale Type must be CUDA_R_32F.
Scaling mode must be CUBLASMP_MATMUL_MATRIX_SCALE_VEC16_UE4M3.
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpSyrk#
cublasMpStatus_t cublasMpSyrk(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha\text{op}(A)\text{op}(A)^{T} + \beta C\)
\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ \end{matrix} \right.\)
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements. |
trans |
Host |
In |
Operation op(A) that is non- or transpose. |
n |
Host |
In |
Number of rows of sub(A) and sub(C). |
k |
Host |
In |
Number of columns of sub(A). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In/Out |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. See table below for supported combinations. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpSyrk_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpSyrk_bufferSize(). |
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpSyrk_bufferSize#
cublasMpStatus_t cublasMpSyrk_bufferSize(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements. |
trans |
Host |
In |
Operation op(A) that is non- or transpose. |
n |
Host |
In |
Number of rows of sub(A) and sub(C). |
k |
Host |
In |
Number of columns of sub(A). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. See table below for supported combinations. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpSyrk(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpSyrk(). |
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpSyr2k#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpSyr2k(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha\text{op}(A)\text{op}(B)^{T} + \overline{\alpha}\text{op}(B)\text{op}(A)^{T} + \beta C\)
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements. |
trans |
Host |
In |
Operation op(A) that is non- or transpose. |
n |
Host |
In |
Number of rows of op(A), op(B), and C. |
k |
Host |
In |
Number of columns of op(A) and op(B). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
<type> scalar used for multiplication. For complex types, beta is real. |
c |
Device |
In/Out |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpSyr2k_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpSyr2k_bufferSize(). |
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpSyr2k_bufferSize#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpSyr2k_bufferSize(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored. |
trans |
Host |
In |
Operation op(A) that is non- or transpose. |
n |
Host |
In |
Number of rows of op(A), op(B), and C. |
k |
Host |
In |
Number of columns of op(A) and op(B). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpSyr2k(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpSyr2k(). |
cublasMpSyrkx#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpSyrkx(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha\text{op}(A)\text{op}(B)^{T} + \beta C\)
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements. |
trans |
Host |
In |
Operation op(A) that is non- or transpose. |
n |
Host |
In |
Number of rows of op(A), op(B), and C. |
k |
Host |
In |
Number of columns of op(A) and op(B). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
<type> scalar used for multiplication. For complex types, beta is real. |
c |
Device |
In/Out |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpSyrkx_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpSyrkx_bufferSize(). |
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpSyrkx_bufferSize#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpSyrkx_bufferSize(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored. |
trans |
Host |
In |
Operation op(A) that is non- or transpose. |
n |
Host |
In |
Number of rows of op(A), op(B), and C. |
k |
Host |
In |
Number of columns of op(A) and op(B). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpSyrkx(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpSyrkx(). |
cublasMpSymm#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpSymm(
cublasMpHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha A B + \beta C \quad \text{(side = LEFT)}\)
\(C = \alpha B A + \beta C \quad \text{(side = RIGHT)}\)
side == LEFT and \(n \times n\) when side == RIGHT. For complex types, this computes the Hermitian matrix multiply (HEMM).Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
side |
Host |
In |
Indicates if the symmetric matrix A appears on the left or right of B. |
uplo |
Host |
In |
Indicates if matrix A lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements. |
m |
Host |
In |
Number of rows of B and C. |
n |
Host |
In |
Number of columns of B and C. |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global symmetric matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In/Out |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpSymm_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpSymm_bufferSize(). |
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpSymm_bufferSize#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpSymm_bufferSize(
cublasMpHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
side |
Host |
In |
Indicates if the symmetric matrix A appears on the left or right of B. |
uplo |
Host |
In |
Indicates if matrix A lower or upper part is stored. |
m |
Host |
In |
Number of rows of B and C. |
n |
Host |
In |
Number of columns of B and C. |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global symmetric matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpSymm(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpSymm(). |
cublasMpTrmm#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpTrmm(
cublasMpHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
cublasDiagType_t diag,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha\text{op}(A) B \quad \text{(side = LEFT)}\)
\(C = \alpha B\text{op}(A) \quad \text{(side = RIGHT)}\)
side == LEFT and \(n \times n\) when side == RIGHT. Also, for matrix \(A\)\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
side |
Host |
In |
Indicates if the triangular matrix A appears on the left or right of B. |
uplo |
Host |
In |
Indicates if matrix A lower or upper part is stored, the other part is not referenced. |
trans |
Host |
In |
Operation op(A) that is non- or (conj.) transpose. |
diag |
Host |
In |
Indicates if diagonal elements of A are unity. |
m |
Host |
In |
Number of rows of B and C. |
n |
Host |
In |
Number of columns of B and C. |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global triangular matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
c |
Device |
Out |
Pointer to the first entry of the local portion of the global output matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpTrmm_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpTrmm_bufferSize(). |
Compute Type |
Scale Type (alpha) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
|
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
|
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpTrmm_bufferSize#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpTrmm_bufferSize(
cublasMpHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
cublasDiagType_t diag,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
side |
Host |
In |
Indicates if the triangular matrix A appears on the left or right of B. |
uplo |
Host |
In |
Indicates if matrix A lower or upper part is stored. |
trans |
Host |
In |
Operation op(A) that is non- or (conj.) transpose. |
diag |
Host |
In |
Indicates if diagonal elements of A are unity. |
m |
Host |
In |
Number of rows of B and C. |
n |
Host |
In |
Number of columns of B and C. |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global triangular matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global output matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpTrmm(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpTrmm(). |
cublasMpHerk#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpHerk(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha\text{op}(A)\text{op}(A)^{H} + \beta C\)
trans == CUBLAS_OP_N, or \(A^{H}\) if trans == CUBLAS_OP_C.Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements. |
trans |
Host |
In |
Operation op(A) that is non- or conjugate-transpose. |
n |
Host |
In |
Number of rows of op(A) and C. |
k |
Host |
In |
Number of columns of op(A). |
alpha |
Host |
In |
Real scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
beta |
Host |
In |
Real scalar used for multiplication. |
c |
Device |
In/Out |
Pointer to the first entry of the local portion of the global Hermitian matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpHerk_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpHerk_bufferSize(). |
Compute Type |
Scale Type (alpha and beta) |
Atype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_R_32F |
CUDA_C_32F |
CUDA_C_32F |
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
CUBLAS_COMPUTE_64F_EMULATED_FIXEDPOINT (CTK 13.0.2+) |
CUDA_R_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpHerk_bufferSize#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpHerk_bufferSize(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored. |
trans |
Host |
In |
Operation op(A) that is non- or conjugate-transpose. |
n |
Host |
In |
Number of rows of op(A) and C. |
k |
Host |
In |
Number of columns of op(A). |
alpha |
Host |
In |
Real scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
beta |
Host |
In |
Real scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global Hermitian matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpHerk(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpHerk(). |
cublasMpHer2k#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpHer2k(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha\text{op}(A)\text{op}(B)^{H} + \overline{\alpha}\text{op}(B)\text{op}(A)^{H} + \beta C\)
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements. |
trans |
Host |
In |
Operation op(A) that is non- or conjugate-transpose. |
n |
Host |
In |
Number of rows of op(A), op(B), and C. |
k |
Host |
In |
Number of columns of op(A) and op(B). |
alpha |
Host |
In |
Complex scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
Real scalar used for multiplication. |
c |
Device |
In/Out |
Pointer to the first entry of the local portion of the global Hermitian matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpHer2k_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpHer2k_bufferSize(). |
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_C_32F (alpha), CUDA_R_32F (beta) |
CUDA_C_32F |
CUDA_C_32F |
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_C_64F (alpha), CUDA_R_64F (beta) |
CUDA_C_64F |
CUDA_C_64F |
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_C_64F (alpha), CUDA_R_64F (beta) |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpHer2k_bufferSize#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpHer2k_bufferSize(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored. |
trans |
Host |
In |
Operation op(A) that is non- or conjugate-transpose. |
n |
Host |
In |
Number of rows of op(A), op(B), and C. |
k |
Host |
In |
Number of columns of op(A) and op(B). |
alpha |
Host |
In |
Complex scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
Real scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global Hermitian matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpHer2k(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpHer2k(). |
cublasMpHerkx#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpHerkx(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha\text{op}(A)\text{op}(B)^{H} + \beta C\)
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements. |
trans |
Host |
In |
Operation op(A) that is non- or conjugate-transpose. |
n |
Host |
In |
Number of rows of op(A), op(B), and C. |
k |
Host |
In |
Number of columns of op(A) and op(B). |
alpha |
Host |
In |
Complex scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
Real scalar used for multiplication. |
c |
Device |
In/Out |
Pointer to the first entry of the local portion of the global Hermitian matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpHerkx_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpHerkx_bufferSize(). |
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_C_32F (alpha), CUDA_R_32F (beta) |
CUDA_C_32F |
CUDA_C_32F |
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_C_64F (alpha), CUDA_R_64F (beta) |
CUDA_C_64F |
CUDA_C_64F |
CUBLAS_COMPUTE_64F_EMULATED _FIXEDPOINT (CTK 13.0.2+) |
CUDA_C_64F (alpha), CUDA_R_64F (beta) |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpHerkx_bufferSize#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpHerkx_bufferSize(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t n,
int64_t k,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored. |
trans |
Host |
In |
Operation op(A) that is non- or conjugate-transpose. |
n |
Host |
In |
Number of rows of op(A), op(B), and C. |
k |
Host |
In |
Number of columns of op(A) and op(B). |
alpha |
Host |
In |
Complex scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
Real scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global Hermitian matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpHerkx(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpHerkx(). |
cublasMpHemm#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpHemm(
cublasMpHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha A B + \beta C \quad \text{(side = LEFT)}\)
\(C = \alpha B A + \beta C \quad \text{(side = RIGHT)}\)
side == LEFT and \(n \times n\) when side == RIGHT.Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
side |
Host |
In |
Indicates if the Hermitian matrix A appears on the left or right of B. |
uplo |
Host |
In |
Indicates if matrix A lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements. |
m |
Host |
In |
Number of rows of B and C. |
n |
Host |
In |
Number of columns of B and C. |
alpha |
Host |
In |
Complex scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global Hermitian matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
Complex scalar used for multiplication. |
c |
Device |
In/Out |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpHemm_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpHemm_bufferSize(). |
Compute Type |
Scale Type (alpha and beta) |
Atype/Btype |
Ctype |
|---|---|---|---|
CUBLAS_COMPUTE_32F CUBLAS_COMPUTE_32F_PEDANTIC CUBLAS_COMPUTE_32F_FAST_16F CUBLAS_COMPUTE_32F_FAST_16BF CUBLAS_COMPUTE_32F_FAST_TF32 |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
CUBLAS_COMPUTE_64F CUBLAS_COMPUTE_64F_PEDANTIC |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
CUBLAS_COMPUTE_64F_EMULATED_FIXEDPOINT (CTK 13.0.2+) |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
computeType parameter provided to this function is used only for internal matrix-matrix multiplications.cublasMpHemm_bufferSize#
Added in cuBLASMp 0.9.0
cublasMpStatus_t cublasMpHemm_bufferSize(
cublasMpHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* b,
int64_t ib,
int64_t jb,
cublasMpMatrixDescriptor_t descB,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
cublasComputeType_t computeType,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
side |
Host |
In |
Indicates if the Hermitian matrix A appears on the left or right of B. |
uplo |
Host |
In |
Indicates if matrix A lower or upper part is stored. |
m |
Host |
In |
Number of rows of B and C. |
n |
Host |
In |
Number of columns of B and C. |
alpha |
Host |
In |
Complex scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global Hermitian matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
b |
Device |
In |
Pointer to the first entry of the local portion of the global matrix B. |
ib |
Host |
In |
Row index of the first row of the sub(B). |
jb |
Host |
In |
Column index of the first column of the sub(B). |
descB |
Host |
In |
Matrix descriptor associated to the global matrix B. |
beta |
Host |
In |
Complex scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
computeType |
Host |
In |
cuBLAS compute type used for computations. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpHemm(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpHemm(). |
cublasMpGeadd#
cublasMpStatus_t cublasMpGeadd(
cublasMpHandle_t handle,
cublasOperation_t trans,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha\text{op}(A) + \beta C\)
\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
trans |
Host |
In |
Operation op(A) that is non- or (conj.) transpose. |
m |
Host |
In |
Number of rows of sub(A) and sub(C). |
n |
Host |
In |
Number of columns of sub(A) and sub(C). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In/Out |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpGeadd_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpGeadd_bufferSize(). |
Data Type of A |
computeType |
Output Data Type |
|---|---|---|
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
cublasMpGeadd_bufferSize#
cublasMpStatus_t cublasMpGeadd_bufferSize(
cublasMpHandle_t handle,
cublasOperation_t trans,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
trans |
Host |
In |
Operation op(A) that is non- or (conj.) transpose. |
m |
Host |
In |
Number of rows of sub(A) and sub(C). |
n |
Host |
In |
Number of columns of sub(A) and sub(C). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpGeadd(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpGeadd(). |
Data Type of A |
computeType |
Output Data Type |
|---|---|---|
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
cublasMpTradd#
cublasMpStatus_t cublasMpTradd(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
void* d_work,
size_t workspaceSizeInBytesOnDevice,
void* h_work,
size_t workspaceSizeInBytesOnHost);
\(C = \alpha\text{op}(A) + \beta C\)
\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\ A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\ A^{H} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\ \end{matrix} \right.\)
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored, the other part is not referenced and is inferred from the stored elements. |
trans |
Host |
In |
Operation op(A) that is non- or (conj.) transpose. |
m |
Host |
In |
Number of rows of sub(A) and sub(C). |
n |
Host |
In |
Number of columns of sub(A) and sub(C). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In/Out |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
d_work |
Device |
Out |
Device workspace of size |
workspaceSizeInBytesOnDevice |
Host |
In |
The size in bytes of the local device workspace needed by the routine as provided by cublasMpTradd_bufferSize(). |
h_work |
Host |
Out |
Host workspace of size |
workspaceSizeInBytesOnHost |
Host |
In |
The size in bytes of the local host workspace needed by the routine as provided by cublasMpTradd_bufferSize(). |
Data Type of A |
computeType |
Output Data Type |
|---|---|---|
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |
cublasMpTradd_bufferSize#
cublasMpStatus_t cublasMpTradd_bufferSize(
cublasMpHandle_t handle,
cublasFillMode_t uplo,
cublasOperation_t trans,
int64_t m,
int64_t n,
const void* alpha,
const void* a,
int64_t ia,
int64_t ja,
cublasMpMatrixDescriptor_t descA,
const void* beta,
void* c,
int64_t ic,
int64_t jc,
cublasMpMatrixDescriptor_t descC,
size_t* workspaceSizeInBytesOnDevice,
size_t* workspaceSizeInBytesOnHost);
Parameter |
Memory |
In/Out |
Description |
|---|---|---|---|
handle |
Host |
In |
cuBLASMp library handle. |
uplo |
Host |
In |
Indicates if matrix C lower or upper part is stored, the other part is not referenced and is inferred from the stored elements. |
trans |
Host |
In |
Operation op(A) that is non- or (conj.) transpose. |
m |
Host |
In |
Number of rows of sub(A) and sub(C). |
n |
Host |
In |
Number of columns of sub(A) and sub(C). |
alpha |
Host |
In |
<type> scalar used for multiplication. |
a |
Device |
In |
Pointer to the first entry of the local portion of the global matrix A. |
ia |
Host |
In |
Row index of the first row of the sub(A). |
ja |
Host |
In |
Column index of the first column of the sub(A). |
descA |
Host |
In |
Matrix descriptor associated to the global matrix A. |
beta |
Host |
In |
<type> scalar used for multiplication. |
c |
Device |
In |
Pointer to the first entry of the local portion of the global matrix C. |
ic |
Host |
In |
Row index of the first row of the sub(C). |
jc |
Host |
In |
Column index of the first column of the sub(C). |
descC |
Host |
In |
Matrix descriptor associated to the global matrix C. |
workspaceSizeInBytesOnDevice |
Host |
Out |
On output, contains the size in bytes of the local device workspace needed by cublasMpTradd(). |
workspaceSizeInBytesOnHost |
Host |
Out |
On output, contains the size in bytes of the local host workspace needed by cublasMpTradd(). |
Data Type of A |
computeType |
Output Data Type |
|---|---|---|
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_32F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_R_64F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_32F |
CUDA_C_64F |
CUDA_C_64F |
CUDA_C_64F |