cuSOLVERMp C API#

Library Management#

cusolverMpCreate#

cusolverStatus_t cusolverMpCreate(
        cusolverMpHandle_t *handle,
        int device,
        cudaStream_t stream)
This function initializes the cuSOLVERMp library handle (cusolverMpHandle_t) which holds the cuSOLVERMp library context. It allocates light hardware resources on the host, and must be called prior to making any other cuSOLVERMp library calls.
Calling any cuSOLVERMp function which uses cusolverMpHandle_t without a previous call of cusolverMpCreate() will return an error.
The cuSOLVERMp library context is tied to the CUDA device provided by device and the CUDA stream stream.
The stream may be either the CUDA default stream or a user-created stream.
Only one handle per process and per GPU supported. Sharing a device with multiple processes will result in undefined behavior.

Parameter

Memory

In/Out

Description

handle

Host

Out

cuSOLVERMp library handle.

device

Host

In

Device that will be assigned to the handle.

stream

Host

In

Stream that will be assigned to the handle.

See cusolverStatus_t for the description of the return status.

cusolverMpDestroy#

cusolverStatus_t cusolverMpDestroy(
        cusolverMpHandle_t handle)
This function destroys the cuSOLVERMp library handle (cusolverMpHandle_t) which holds the cuSOLVERMp library context.
The cuSOLVERMp library context is tied to the CUDA device provided by device. Only one handle per process and per GPU supported.

Parameter

Memory

In/Out

Description

handle

Host

In/Out

cuSOLVERMp library handle.

See cusolverStatus_t for the description of the return status.

cusolverMpSetStream#

cusolverStatus_t cusolverMpSetStream(
        cusolverMpHandle_t handle,
        cudaStream_t stream)
This function updates the stream associated to the handle.
The stream may be either the CUDA default stream or a user-created stream.
The new stream must belong to the same CUDA context as the handle.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

stream

Host

In

New stream associated with the handle.

See cusolverStatus_t for the description of the return status.

cusolverMpGetStream#

cusolverStatus_t cusolverMpGetStream(
        cusolverMpHandle_t handle,
        cudaStream_t *stream)
This function returns the stream associated to the handle.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

stream

Host

Out

Stream associated with the handle.

See cusolverStatus_t for the description of the return status.

cusolverMpGetVersion#

cusolverStatus_t cusolverMpGetVersion(
        cusolverMpHandle_t handle,
        int *version)
This function returns the version number of the cuSOLVERMp library.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

version

Host

Out

cuSOLVERMp library version. Value is CUSOLVERMP_VER_MAJOR * 1000 + CUSOLVERMP_VER_MINOR * 100 + CUSOLVERMP_VER_PATCH.

See cusolverStatus_t for the description of the return status.

cusolverMpSetMathMode#

cusolverStatus_t cusolverMpSetMathMode(
        cusolverMpHandle_t handle,
        cusolverMathMode_t mode)
This function sets the math mode for the cuSOLVERMp library handle. It will be propagated to the internal cuBLAS and cuSOLVER handles. The math mode controls the behavior of mathematical operations performed by the library.

Note

Please note that the workspace sizes returned by *_bufferSize APIs may depend on the math mode.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

mode

Host

In

Math mode to be set for the handle. Available options are CUSOLVER_DEFAULT_MATH and CUSOLVER_FP32_EMULATED_BF16X9_MATH. See cusolverMathMode_t for details.

See cusolverStatus_t for the description of the return status.

cusolverMpGetMathMode#

cusolverStatus_t cusolverMpGetMathMode(
        cusolverMpHandle_t handle,
        cusolverMathMode_t *mode)
This function retrieves the current math mode from the cuSOLVERMp library handle.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

mode

Host

Out

Current math mode of the handle.

See cusolverStatus_t for the description of the return status.

cusolverMpSetEmulationStrategy#

cusolverStatus_t cusolverMpSetEmulationStrategy(
        cusolverMpHandle_t handle,
        cudaEmulationStrategy_t strategy)
This function sets the emulation strategy for the cuSOLVERMp library handle. It will be propagated to the internal cuBLAS and cuSOLVER handles.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

strategy

Host

In

Emulation strategy to be set for the handle. Available options are CUDA_EMULATION_STRATEGY_DEFAULT, CUDA_EMULATION_STRATEGY_PERFORMANT, and CUDA_EMULATION_STRATEGY_EAGER. For more information about the effects of the corresponding strategies, please refer to the analogous definition of cublasEmulationStrategy_t.

The emulation strategy only has an effect if the math mode is set to CUSOLVER_FP32_EMULATED_BF16X9_MATH.
See cusolverStatus_t for the description of the return status.

cusolverMpGetEmulationStrategy#

cusolverStatus_t cusolverMpGetEmulationStrategy(
        cusolverMpHandle_t handle,
        cudaEmulationStrategy_t *strategy)
This function retrieves the current emulation strategy from the cuSOLVERMp library handle.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

strategy

Host

Out

Current emulation strategy of the handle.

See cusolverStatus_t for the description of the return status.

Grid Management#

cusolverMpCreateDeviceGrid#

cusolverStatus_t cusolverMpCreateDeviceGrid(
        cusolverMpHandle_t handle,
        cusolverMpGrid_t *grid,
        ncclComm_t comm,
        int32_t numRowDevices,
        int32_t numColDevices,
        cusolverMpGridMapping_t mapping)
This function initializes the grid opaque data structure. It maps the given resources (communicator, grid dimensions and grid layout) to a grid object.
All the processes defined to be in this grid must enter this function.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

grid

Host

Out

Grid object to be initialized.

comm

Host

In

Communicator that will be associated with the grid.

numRowDevices

Host

In

How many process rows the grid will contain.

numColDevices

Host

In

How many process columns the grid will contain.

mapping

Host

In

How to map processes to the grid. See description of cusolverMpGrid_t for further details.

See cusolverStatus_t for the description of the return status.

cusolverMpDestroyGrid#

cusolverStatus_t cusolverMpDestroyGrid(
        cusolverMpGrid_t grid)
This function destroys the given grid object.
All the processes defined to be in this grid must enter this function.

Parameter

Memory

In/Out

Description

grid

Host

In/Out

Grid object to be destroyed.

See cusolverStatus_t for the description of the return status.

Memory Management#

cusolverMpBufferRegister#

cusolverStatus_t cusolverMpBufferRegister(
        cusolverMpGrid_t grid,
        void *ptr,
        size_t size)
Registers a device buffer on the grid’s global communicator using ncclCommWindowRegister with NCCL_WIN_COLL_SYMMETRIC. Pre-registering workspace buffers can improve the performance of NCCL collective operations by enabling hardware-accelerated communication paths.

The buffer must be compatible with NCCL symmetric memory registration, for example when allocated with ncclMemAlloc. Registration is idempotent: re-registering the same pointer with the same size is a no-op.

This is a collective operation: all ranks in the grid must call this function together with the same ptr and size values. Re-registering the same pointer with a different size is invalid.

Currently, buffer registration is expected to yield performance benefits when the registered buffer is used as the device workspace for cusolverMpNewtonSchulz().

If NCCL symmetric memory is not supported by the platform, or if ptr is not compatible with NCCL symmetric memory registration, this function returns CUSOLVER_STATUS_NOT_SUPPORTED.

Parameter

Memory

In/Out

Description

grid

Host

In

Grid object (provides the NCCL communicator).

ptr

Device

In

Device buffer compatible with NCCL symmetric memory registration.

size

Host

In

Buffer size in bytes.

See cusolverStatus_t for the description of the return status.

cusolverMpBufferDeregister#

cusolverStatus_t cusolverMpBufferDeregister(
        cusolverMpGrid_t grid,
        void *ptr)
Deregisters a device buffer previously registered with cusolverMpBufferRegister().

This is a collective operation: all ranks in the grid must call this function together with the same ptr value. Should be called before freeing the underlying allocation; otherwise behavior is undefined.

Deregistering a pointer that was not previously registered returns CUSOLVER_STATUS_INVALID_VALUE.

Parameter

Memory

In/Out

Description

grid

Host

In

Grid object (provides the NCCL communicator).

ptr

Device

In

Device buffer to deregister.

See cusolverStatus_t for the description of the return status.

Matrix Management#

cusolverMpCreateMatrixDesc#

cusolverStatus_t cusolverMpCreateMatrixDesc(
        cusolverMpMatrixDescriptor_t *desc,
        cusolverMpGrid_t grid,
        cudaDataType dataType,
        int64_t M_A,
        int64_t N_A,
        int64_t MB_A,
        int64_t NB_A,
        uint32_t RSRC_A,
        uint32_t CSRC_A,
        int64_t LLD_A)
This function initializes cusolverMpMatrixDescriptor_t object.

Parameter

Memory

In/Out

Description

desc

Host

Out

Matrix descriptor object initialized by this function.

grid

Host

In

Grid object associated with the global matrix A.

dataType

Host

In

Data type of the matrix A.

M_A

Host

In

Number of rows in the global matrix A.

N_A

Host

In

Number of columns in the global matrix A.

MB_A

Host

In

Blocking factor used to distribute the rows of the global matrix A.

NB_A

Host

In

Blocking factor used to distribute the columns of the global matrix A.

RSRC_A

Host

In

Process row over which the first row of the matrix A is distributed. Only the value of 0 is currently supported.

CSRC_A

Host

In

Process column over which the first column of the matrix A is distributed. Only the value of 0 is currently supported.

LLD_A

Host

In

Leading dimension of the local matrix.

Supported values for dataType argument are listed below:

Data Type of A

Description

CUDA_R_16F

Half precision real values.

CUDA_R_16BF

bfloat16 real values.

CUDA_R_32I

32-bit integer values.

CUDA_R_64I

64-bit integer values.

CUDA_R_32F

Single precision real values.

CUDA_R_64F

Double precision real values.

CUDA_C_32F

Single precision complex values.

CUDA_C_64F

Double precision complex values.

See cusolverStatus_t for the description of the return status.

cusolverMpDestroyMatrixDesc#

cusolverStatus_t cusolverMpDestroyMatrixDesc(
        cusolverMpMatrixDescriptor_t desc)
This function destroys cusolverMpMatrixDescriptor_t object.

Parameter

Memory

In/Out

Description

desc

Host

In/Out

Matrix descriptor object destroyed by this function.

See cusolverStatus_t for the description of the return status.

Newton-Schulz Properties#

cusolverMpNewtonSchulzDescriptorCreate#

cusolverStatus_t cusolverMpNewtonSchulzDescriptorCreate(
        cusolverMpNewtonSchulzDescriptor_t *nsDesc)
Creates a Newton-Schulz descriptor that controls the behavior of cusolverMpNewtonSchulz().
The descriptor is initialized with default values (normalization enabled, reduction via compute type disabled).

Parameter

Memory

In/Out

Description

nsDesc

Host

Out

cusolverMpNewtonSchulzDescriptor_t descriptor to be created.

See cusolverStatus_t for the description of the return status.

cusolverMpNewtonSchulzDescriptorDestroy#

cusolverStatus_t cusolverMpNewtonSchulzDescriptorDestroy(
        cusolverMpNewtonSchulzDescriptor_t nsDesc)
Destroys a Newton-Schulz descriptor previously created with cusolverMpNewtonSchulzDescriptorCreate().

Parameter

Memory

In/Out

Description

nsDesc

Host

In/Out

Newton-Schulz descriptor to be destroyed.

See cusolverStatus_t for the description of the return status.

cusolverMpNewtonSchulzDescriptorSetAttribute#

cusolverStatus_t cusolverMpNewtonSchulzDescriptorSetAttribute(
        cusolverMpNewtonSchulzDescriptor_t nsDesc,
        cusolverMpNewtonSchulzDescriptorAttribute_t attr,
        const void *buf,
        size_t sizeInBytes)
The following attributes are supported:
  • CUSOLVERMP_NEWTON_SCHULZ_DESCRIPTOR_ATTRIBUTE_NORMALIZE (int, default 1): When set to 1, the input matrix is normalized by its Frobenius norm before the iterations begin. Normalization is required for convergence. Set to 0 only when the input is already normalized (e.g., to avoid redundant normalization in a pipeline that pre-normalizes the matrix).

  • CUSOLVERMP_NEWTON_SCHULZ_DESCRIPTOR_ATTRIBUTE_REDUCE_VIA_COMPUTE_TYPE (int, default 0): When set to 1, the distributed Gram-matrix reduction path may communicate/reduce intermediate X^T X data using the compute type when the value type differs from the compute type.

Parameter

Memory

In/Out

Description

nsDesc

Host

In/Out

Newton-Schulz descriptor.

attr

Host

In

cusolverMpNewtonSchulzDescriptorAttribute_t attribute to set.

buf

Host

In

Pointer to the attribute value.

sizeInBytes

Host

In

Size of the attribute value in bytes.

See cusolverStatus_t for the description of the return status.

cusolverMpNewtonSchulzDescriptorGetAttribute#

cusolverStatus_t cusolverMpNewtonSchulzDescriptorGetAttribute(
        cusolverMpNewtonSchulzDescriptor_t nsDesc,
        cusolverMpNewtonSchulzDescriptorAttribute_t attr,
        void *buf,
        size_t sizeInBytes,
        size_t *sizeInBytesWritten)
Gets an attribute from a Newton-Schulz descriptor.

Parameter

Memory

In/Out

Description

nsDesc

Host

In

Newton-Schulz descriptor.

attr

Host

In

cusolverMpNewtonSchulzDescriptorAttribute_t attribute to query.

buf

Host

Out

Buffer to receive the attribute value.

sizeInBytes

Host

In

Size of the output buffer in bytes.

sizeInBytesWritten

Host

Out

Number of bytes actually written to buf.

See cusolverStatus_t for the description of the return status.

Utility#

cusolverMpNUMROC#

int64_t cusolverMpNUMROC(
        int64_t n,
        int64_t nb,
        uint32_t iproc,
        uint32_t isrcproc,
        uint32_t nprocs)
Computes the number of rows or columns of a distributed matrix owned by the process indicated by iproc argument.

Parameter

Memory

In/Out

Description

n

Host

In

Number of rows or columns in the global distributed matrix.

nb

Host

In

Row or column blocking size of the global matrix.

iproc

Host

In

The coordinate of the process whose local array row or column is to be determined.

isrcproc

Host

In

The coordinate of the process that owns the first row or column of the distributed matrix.

nprocs

Host

In

The total number of row or column processes over which the matrix is distributed.

Returns the number of rows or columns of a distributed matrix owned by the process indicated by iproc argument.

cusolverMpMatrixGatherD2H#

cusolverStatus_t cusolverMpMatrixGatherD2H(
        cusolverMpHandle_t handle,
        int64_t M,
        int64_t N,
        const void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        int root,
        void *h_dst,
        int64_t h_lddst)
Gathers the global distributed matrix A on a buffer provided on process root. The input matrix A is originally distributed using 2D block-cyclic format, on output h_dst contains the matrix in column-major format.
Notice that, for this function, the input data is on the device and the output is stored on host memory.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

M

Host

In

Number of rows of the global distributed matrix A.

N

Host

In

Number of columns of the global distributed matrix A.

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). On entry, this array contains the local pieces of the M-by-N distributed matrix sub(A).

IA

Host

In

Row index in the global matrix A indicating the first row of sub(A). This function does not make any assumptions on the alignment of IA.

JA

Host

In

Column index in the global matrix A indicating the first column of sub(A). This function does not make any assumptions on the alignment of JA.

descA

Host

In

Matrix descriptor of the global matrix A.

root

Host

In

Process ID on which the matrix A will be gathered.

h_dst

Host

Out

Destination host buffer on root process. On output it contains the global matrix A stored in column-major format. Total size must be at least M*N words.

h_lddst

Host

In

Leading dimension of the h_dst on root process. Must be larger than M.

Return Status:
See cusolverStatus_t for the description of the return status.

Warning

This function is meant as a utility function to verify correctness of the data layouts and it is not intended to achieve high performance on large inputs.


cusolverMpMatrixScatterH2D#

cusolverStatus_t cusolverMpMatrixScatterH2D(
        cusolverMpHandle_t handle,
        int64_t M,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        int root,
        const void *h_src,
        int64_t h_ldsrc)
Scatters the matrix stored in the local buffer h_src from root process to a distributed global matrix A.
The input matrix h_src is stored in column-major format. On output, d_A contains the local portions of the global matrix A distributed in 2D block-cyclic format.
Notice that, for this function, the input data is on the host and the output is stored on device memory.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

M

Host

In

Number of rows of the global distributed matrix A.

N

Host

In

Number of columns of the global distributed matrix A.

d_A

Device

Out

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). On output, this array contains the local pieces of the M-by-N distributed matrix sub(A).

IA

Host

In

Row index in the global matrix A indicating the first row of sub(A). This function does not make any assumptions on the alignment of IA.

JA

Host

In

Column index in the global matrix A indicating the first column of sub(A). This function does not make any assumptions on the alignment of JA.

descA

Host

In

Matrix descriptor of the global matrix A.

root

Host

In

Process ID which the matrix A will be scattered from.

h_src

Host

In

Source buffer on root process. On input it contains the global M by N matrix A stored in column-major format.

h_ldsrc

Host

In

Leading dimension of the h_src on root process. Must be larger than M.

See cusolverStatus_t for the description of the return status.

Warning

This function is meant as a utility function to verify correctness of the data layouts and it is not intended to achieve high performance on large inputs.


Logging#

cusolverMpLoggerSetCallback#

cusolverStatus_t cusolverMpLoggerSetCallback(
        cusolverMpLoggerCallback_t callback)
This function sets the logging callback function.

Parameter

Memory

In/Out

Description

callback

Host

In

Pointer to a callback function. See cusolverMpLoggerCallback_t.

See cusolverStatus_t for the description of the return status.

Warning

This is an experimental feature.


cusolverMpLoggerSetFile#

cusolverStatus_t cusolverMpLoggerSetFile(
        FILE *file)
This function sets the logging output file. Note: once registered using this function call, the provided file handle must not be closed unless the function is called again to switch to a different file handle.

Parameter

Memory

In/Out

Description

file

Host

In

Pointer to an open file. File should have write permission.

See cusolverStatus_t for the description of the return status.

Warning

This is an experimental feature.


cusolverMpLoggerOpenFile#

cusolverStatus_t cusolverMpLoggerOpenFile(
        const char* logFile)
This function opens a logging output file in the given path.

Parameter

Memory

In/Out

Description

logFile

Host

In

Path of the logging output file.

See cusolverStatus_t for the description of the return status.

Warning

This is an experimental feature.


cusolverMpLoggerSetLevel#

cusolverStatus_t cusolverMpLoggerSetLevel(
        int level)
This function sets the logging level for cuSOLVERMp library. The logging level controls the verbosity of the logging output.

Parameter

Memory

In/Out

Description

level

Host

In

Value of the logging level. See cuSOLVERMp Logging.

See cusolverStatus_t for the description of the return status.

Warning

This is an experimental feature.


cusolverMpLoggerSetMask#

cusolverStatus_t cusolverMpLoggerSetMask(
        int mask)
This function sets the value of the logging mask.

Parameter

Memory

In/Out

Description

mask

Host

In

Value of the logging mask. See cuSOLVERMp Logging.

See cusolverStatus_t for the description of the return status.

Warning

This is an experimental feature.


cusolverMpLoggerForceDisable#

cusolverStatus_t cusolverMpLoggerForceDisable()
This function disables logging for the entire run.
See cusolverStatus_t for the description of the return status.

Warning

This is an experimental feature.


Dense Linear Algebra APIs#

cusolverMpGetrf#

cusolverStatus_t cusolverMpGetrf(
        cusolverMpHandle_t handle,
        int64_t M,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        int64_t  *d_ipiv,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
This routine computes an LU factorization of a general M-by-N distributed matrix sub(A) using partial pivoting. The user can also disable pivoting by setting d_ipiv=NULL.
The factorization has the form:
\[sub(A) = P \cdot L \cdot U\]
where \(P\) is a permutation matrix, \(L\) is lower triangular with unit diagonal elements (lower trapezoidal if \(m > n\)), and \(U\) is upper triangular (upper trapezoidal if \(m < n\)). \(L\) and \(U\) are stored in sub(A).
The user can combine cusolverMpGetrf() and cusolverMpGetrs() to solve a system of linear equations.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

M

Host

In

Number of rows of sub(A).

N

Host

In

Number of columns of sub(A).

d_A

Device

In/Out

Pointer to the first entry of the local portion of the global matrix A. On output, the sub(A) is overwritten with the L and U factors.

IA

Host

In

Row index of the first row of the sub(A). This function does not make any assumptions on the alignment of IA.

JA

Host

In

Column index of the first column of the sub(A). This function does not make any assumptions on the alignment of JA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_ipiv

Device

Out

Local array of dimension (LOCr(M_A)+MB_A). If the user set d_ipiv != NULL, on output, this array contains the pivoting information. d_ipiv[i] indicates the global row local row i was swapped with. This array is tied to the distributed matrix A.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpGetrf_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpGetrf_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function. info > 0 indicates the index of the leading minor in the case of a singular matrix.

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpGetrf_bufferSize#

cusolverStatus_t cusolverMpGetrf_bufferSize(
        cusolverMpHandle_t handle,
        int64_t M,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        int64_t *d_ipiv,
        cudaDataType_t computeType,
        size_t *workspaceInBytesOnDevice,
        size_t *workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpGetrf().
The user can set d_ipiv=NULL so cusolverMpGetrf() will compute the LU factorization of the input matrix A without pivoting.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

M

Host

In

Number of rows of sub(A).

N

Host

In

Number of columns of sub(A).

d_A

Device

In

Pointer to the first entry of the local portion of the global matrix A.

IA

Host

In

Row index of the first row of the sub(A). This function does not make any assumptions on the alignment of IA.

JA

Host

In

Column index of the first column of the sub(A). This function does not make any assumptions on the alignment of JA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_ipiv

Device

In

Indicates a pointer to a distributed integer array. When it is not null, workspace for pivoting is accounted.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cusolverMpGetrf().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cusolverMpGetrf().

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpGetrs#

cusolverStatus_t cusolverMpGetrs(
        cusolverMpHandle_t handle,
        cublasOperation_t trans,
        int64_t N,
        int64_t NRHS,
        const void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        const int64_t *d_ipiv,
        void *d_B,
        int64_t IB,
        int64_t JB,
        cusolverMpMatrixDescriptor_t descB,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *d_info)
This routine solves a system of distributed linear equations
\[op(sub(A)) \cdot X = sub(B)\]
with a general N-by-N distributed matrix sub(A) using the LU factorization computed by cusolverMpGetrf().
Where \(op\) is defined by the argument trans, which allows to solve linear systems of the form:

trans

Form of the linear system

CUBLAS_OP_N

\(sub(A) \cdot X = sub(B)\)

CUBLAS_OP_T

\(sub(A)^T \cdot X = sub(B)\)

CUBLAS_OP_C

\(sub(A)^H \cdot X = sub(B)\)

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

trans

Host

In

Specifies the form of the linear system. Only CUBLAS_OP_N is currently supported.

N

Host

In

Number of rows of sub(A).

NRHS

Host

In

Number of columns of sub(B).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). On entry, this array contains the local pieces of the M-by-N distributed L and U factors of sub(A) as computed by cusolverMpGetrf().

IA

Host

In

Row index of the first row of the sub(A). This function does not make any assumptions on the alignment of IA.

JA

Host

In

Column index of the first column of the sub(A). This function does not make any assumptions on the alignment of JA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_ipiv

Device

In

Local array of dimension (LOCr(M_A)+MB_A) containing the pivoting information as computed by cusolverMpGetrf().

d_B

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_B,LOCc(JB+NRHS-1)). On entry, the right hand sides sub(B). On exit, sub(B) is overwritten by the solution distributed matrix X.

IB

Host

In

Row index of the first row of the sub(B). This function does not make any assumptions on the alignment of IB.

JB

Host

In

Column index of the first column of the sub(B). This function does not make any assumptions on the alignment of JB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpGetrs_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpGetrs_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function. info > 0 indicates the index of the leading minor in the case of a singular matrix.

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpGetrs_bufferSize#

cusolverStatus_t cusolverMpGetrs_bufferSize(
        cusolverMpHandle_t handle,
        cublasOperation_t trans,
        int64_t N,
        int64_t NRHS,
        const void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        const int64_t *d_ipiv,
        void *d_B,
        int64_t IB,
        int64_t JB,
        cusolverMpMatrixDescriptor_t descB,
        cudaDataType_t computeType,
        size_t *workspaceInBytesOnDevice,
        size_t *workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpGetrs().
If pivoting was disabled during cusolverMpGetrf(), the user must set d_ipiv=NULL.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

trans

Host

In

Specifies the form of the linear system. Only CUBLAS_OP_N is currently supported.

N

Host

In

Number of rows of sub(A).

NRHS

Host

In

Number of columns of sub(B).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). On entry, this array contains the local pieces of the M-by-N distributed L and U factors of sub(A) as computed by cusolverMpGetrf().

IA

Host

In

Row index of the first row of the sub(A). This function does not make any assumptions on the alignment of IA.

JA

Host

In

Column index of the first column of the sub(A). This function does not make any assumptions on the alignment of JA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_ipiv

Device

In

Local array of dimension (LOCr(M_A)+MB_A) containing the pivoting information as computed by cusolverMpGetrf().

d_B

Device

In

Pointer to the first entry of the local portion of the global matrix B. On output, B is overwritten the solution of the linear system.

IB

Host

In

Row index of the first row of the sub(B). This function does not make any assumptions on the alignment of IB.

JB

Host

In

Column index of the first column of the sub(B). This function does not make any assumptions on the alignment of JB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cusolverMpGetrs().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cusolverMpGetrs().

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpPotrf#

cusolverStatus_t cusolverMpPotrf(
        cusolverMpHandle_t handle,
        cublasFillMode_t uplo,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
Computes the Cholesky factorization of an N-by-N real symmetric or a complex hermitian positive definite distributed matrix sub(A) denoting A(IA:IA+N-1, JA:JA+N-1).
If A is upper triangular and uplo=CUBLAS_FILL_MODE_UPPER, the factorization has the form
\[sub(A) = U^H \cdot U\]
where U is upper triangular.
If the matrix is lower triangular and uplo is set to CUBLAS_FILL_MODE_LOWER, the factorization has the form
\[sub(A) = L \cdot L^H\]
where L is lower triangular.
The user can combine cusolverMpPotrf() and cusolverMpPotrs() to solve a system of linear equations.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

uplo

Host

In

Specifies if A is upper (CUBLAS_FILL_MODE_UPPER) or lower triangular matrix (CUBLAS_FILL_MODE_LOWER).

N

Host

In

Number of rows and columns of sub(A).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). On entry, this array contains the local pieces of the N-by-N distributed matrix sub(A). On output, this array contains the L or U factors of A, depending on the value of uplo.

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A).`JA` must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpPotrf_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpPotrf_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function. info > 0 indicates the index of the leading minor in the case of a singular matrix.

This function requires square block size (MB_A == NB_A).
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpPotrf_bufferSize#

cusolverStatus_t cusolverMpPotrf_bufferSize(
        cusolverMpHandle_t handle,
        cublasFillMode_t uplo,
        int64_t N,
        const void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        cudaDataType_t computeType,
        size_t* workspaceInBytesOnDevice,
        size_t* workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpPotrf().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

uplo

Host

In

Specifies if A is upper (CUBLAS_FILL_MODE_UPPER) or lower triangular matrix (CUBLAS_FILL_MODE_LOWER).

N

Host

In

Number of rows and columns of sub(A).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). On entry, this array contains the local pieces of the N-by-N distributed matrix sub(A). On output, this array contains the L or U factors of A, depending on the value of uplo.

IA

Host

In

Row index of the first row of the sub(A). This function does not make any assumptions on the alignment of IA.

JA

Host

In

Column index of the first column of the sub(A). This function does not make any assumptions on the alignment of JA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cusolverMpPotrf().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cusolverMpPotrf().

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpPotrs#

cusolverStatus_t cusolverMpPotrs(
        cusolverMpHandle_t handle,
        cublasFillMode_t uplo,
        int64_t N,
        int64_t NRHS,
        const void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        void *d_B,
        int64_t IB,
        int64_t JB,
        cusolverMpMatrixDescriptor_t descB,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
Solves a system of linear equations
\[sub(A) \cdot X = sub(B)\]
where sub(A) denotes A(IA:IA+N-1,JA:JA+N-1) and is a N-by-N symmetric or hermitian positive definite distributed matrix using the Cholesky factorization:
\[sub(A) = U^H \cdot U\]
or
\[sub(A) = L \cdot L^H\]
computed by cusolverMpPotrf() and sub(B) denotes the distributed matrix B(IB:IB+N-1,JB:JB+NRHS-1).

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

uplo

Host

In

Specifies if A is upper (CUBLAS_FILL_MODE_UPPER) or lower triangular matrix (CUBLAS_FILL_MODE_LOWER).

N

Host

In

Number of rows and columns of sub(A).

NRHS

Host

In

Number of columns of sub(B).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). Contains the local pieces of the N-by-N distributed L or U factors of sub(A) as computed by cusolverMpPotrf().

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension NB_A.

JA

Host

In

Column index of the first column of the sub(A). JA must be a multiple of the column blocking dimension MB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_B

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_B,LOCc(JB+NRHS-1)). On entry, the right hand sides sub(B). On exit, sub(B) is overwritten by the solution distributed matrix X.

IB

Host

In

Row index of the first row of the sub(B). This function does not make any assumptions on the alignment of IB.

JB

Host

In

Column index of the first column of the sub(B). This function does not make any assumptions on the alignment of JB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpPotrs_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpPotrs_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function. info > 0 indicates the index of the leading minor in the case of a singular matrix.

This function requires square block size (MB_A == NB_A) and alignment of sub(A) and sub(B) matrices, meaning (MB_A == MB_B) and (IA == IB).
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpPotrs_bufferSize#

cusolverStatus_t cusolverMpPotrs_bufferSize(
        cusolverMpHandle_t handle,
        cublasFillMode_t uplo,
        int64_t n,
        int64_t nrhs,
        const void *a,
        int64_t ia,
        int64_t ja,
        cusolverMpMatrixDescriptor_t descA,
        const void *b,
        int64_t ib,
        int64_t jb,
        cusolverMpMatrixDescriptor_t descB,
        cudaDataType_t computeType,
        size_t* workspaceInBytesOnDevice,
        size_t* workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpPotrs().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

uplo

Host

In

Specifies if A is upper (CUBLAS_FILL_MODE_UPPER) or lower triangular matrix (CUBLAS_FILL_MODE_LOWER).

N

Host

In

Number of rows and columns of sub(A).

NRHS

Host

In

Number of columns of sub(B).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). Contains the local pieces of the N-by-N distributed L or U factors of sub(A) as computed by cusolverMpPotrf().

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension NB_A.

JA

Host

In

Column index of the first column of the sub(A). JA must be a multiple of the column blocking dimension MB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_B

Device

In

Pointer into the local memory to an array of dimension (LLD_B,LOCc(JB+NRHS-1)). On entry, the right hand sides sub(B). On exit, sub(B) is overwritten by the solution distributed matrix X.

IB

Host

In

Row index of the first row of the sub(B). This function does not make any assumptions on the alignment of IB.

JB

Host

In

Column index of the first column of the sub(B). This function does not make any assumptions on the alignment of JB.

descB

Host

In

Matrix descriptor associated to the global matrix B.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cusolverMpPotrs().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cusolverMpPotrs().

This function requires square block size (MB_A == NB_A) and alignment of sub(A) and sub(B) matrices, meaning (MB_A == MB_B) and (IA == IB).
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpGeqrf#

cusolverStatus_t cusolverMpGeqrf(
        cusolverMpHandle_t handle,
        int64_t M,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        void *d_tau,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
Computes the QR factorization of a distributed M-by-N matrix sub(A) denoting A(IA:IA+M-1, JA:JA+N-1).
\[sub(A) = Q \cdot R\]
where Q is an orthogonal matrix represented by a product of Householder reflectors with the array of tau and R is upper triangular matrix.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

M

Host

In

Number of rows of sub(A).

N

Host

In

Number of columns of sub(A).

d_A

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). On entry, this array contains the local pieces of the M-by-N distributed matrix sub(A). On output, this array contains the R factors of A and Householder reflectors below of diagonals with tau vector.

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A). JA must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_tau

Device

Out

Pointer into the local memory to an array of dimension LOCc(JA+N-1). This array contains scalar factors of the Householder reflectors

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpGeqrf_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpGeqrf_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function.

This function requires square block size (MB_A == NB_A).
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpGeqrf_bufferSize#

cusolverStatus_t cusolverMpGeqrf_bufferSize(
        cusolverMpHandle_t handle,
        int64_t M,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        cudaDataType_t computeType,
        size_t* workspaceInBytesOnDevice,
        size_t* workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpGeqrf().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

M

Host

In

Number of rows of sub(A).

N

Host

In

Number of columns of sub(A).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)).

IA

Host

In

Row index in the global matrix A indicating the first row of sub(A). This function does not make any assumptions on the alignment of IA.

JA

Host

In

Column index in the global matrix A indicating the first column of sub(A). This function does not make any assumptions on the alignment of JA.

descA

Host

In

Matrix descriptor associated to the global matrix A.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cusolverMpGeqrf().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cusolverMpGeqrf().

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpOrmqr#

cusolverStatus_t cusolverMpOrmqr(
        cusolverMpHandle_t handle,
        cublasSideMode_t side,
        cublasOperation_t trans,
        int64_t M,
        int64_t N,
        int64_t K,
        const void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        const void *d_tau,
        void *d_C,
        int64_t IC,
        int64_t JC,
        cusolverMpMatrixDescriptor_t descC,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
Multiply distributed M-by-N matrix sub(C) denoting C(IC:IC+M-1, JC:JC+N-1) by the orthogonal matrix Q can be given from cusolverMpGeqrf().
The function can perform the following matrix product and overwrite the result on sub(C).
\[\begin{split}sub(C) = op(Q) \cdot sub(C) \\ sub(C) = sub(C) \cdot op(Q)\end{split}\]
for the side of CUBLAS_SIDE_LEFT and CUBLAS_SIDE_RIGHT respectively. Note that the current implementation only support for CUBLAS_SIDE_LEFT.
Q is a orthogonal matrix formed as the product of Householder reflectors returned from cusolverMpGeqrf().
\[Q = H(1) H(2) ... H(K)\]
The number of the Householder reflectors is constrained by K <= M and K <= N for CUBLAS_SIDE_LEFT and CUBLAS_SIDE_RIGHT respectively.
The op can be translated to \(Q\), \(Q^T\), \(Q^H\) based on the trans argument CUBLAS_OP_N, CUBLAS_OP_T and CUBLAS_OP_H.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

side

Host

In

Indicate that Q is applied from left or right side.

trans

Host

In

Indicate that Q is applied with no-transpose or (conj)transpose.

M

Host

In

Number of rows of sub(A).

N

Host

In

Number of columns of sub(A).

K

Host

In

Number of Householder reflectors defining Q.

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). This array contains Householder reflectors below of diagonals with tau vector

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A).`JA` must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_tau

Device

In

Pointer into the local memory to an array of dimension LOCc(JA+N-1). This array contains scalar factors of the Householder reflectors

d_C

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_C, LOCc(JC+N-1)). On entry, the array contains the local pieces of the M-by-N distributed matrix sub(C). On exit, the sub(C) is overwritten by op(Q)*sub(C) or sub(C)*op(Q).

IC

Host

In

Row index of the first row of the sub(C). IC must be a multiple of the row blocking dimension MB_C.

JC

Host

In

Column index of the first column of the sub(C). JC must be a multiple of the column blocking dimension NB_C.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpOrmqr_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpOrmqr_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function.

This function requires square block size (MB_A == NB_A) and alignment of sub(A) and sub(C) matrices, meaning (MB_A == MB_C) and (IA == IC).
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpOrmqr_bufferSize#

cusolverStatus_t cusolverMpOrmqr_bufferSize(
        cusolverMpHandle_t handle,
        cublasSideMode_t side,
        cublasOperation_t trans,
        int64_t M,
        int64_t N,
        int64_t K,
        const void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        const void *d_tau,
        void *d_C,
        int64_t IC,
        int64_t JC,
        cusolverMpMatrixDescriptor_t descC,
        cudaDataType_t computeType,
        size_t* workspaceInBytesOnDevice,
        size_t* workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpOrmqr().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

side

Host

In

Indicate that Q is applied from left or right side.

trans

Host

In

Indicate that Q is applied with no-transpose or (conj)transpose.

M

Host

In

Number of rows of sub(A).

N

Host

In

Number of columns of sub(A).

K

Host

In

Number of Householder reflectors defining Q.

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). This array contains Householder reflectors below of diagonals with tau vector

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A).`JA` must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_tau

Device

In

Pointer into the local memory to an array of dimension LOCc(JA+N-1). This array contains scalar factors of the Householder reflectors

d_C

Device

In

Pointer into the local memory to an array of dimension (LLD_C, LOCc(JC+N-1)). On entry, the array contains the local pieces of the M-by-N distributed matrix sub(C). On exit, the sub(C) is overwritten by op(Q)*sub(C) or sub(C)*op(Q).

IC

Host

In

Row index of the first row of the sub(C). IC must be a multiple of the row blocking dimension MB_C.

JC

Host

In

Column index of the first column of the sub(C). JC must be a multiple of the column blocking dimension NB_C.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

The size in bytes of the local device workspace needed by cusolverMpOrmqr().

workspaceInBytesOnHost

Host

Out

The size in bytes of the local host workspace needed by cusolverMpOrmqr().

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpOrgqr#

cusolverStatus_t cusolverMpOrgqr(
        cusolverMpHandle_t handle,
        int64_t M,
        int64_t N,
        int64_t K,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        const void *d_tau,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *d_info)
Generates the M-by-N matrix Q with orthonormal columns from the QR factorization computed by cusolverMpGeqrf(). Q is defined as the product of K elementary Householder reflectors of order M:
\[Q = H(1) \cdot H(2) \cdot \ldots \cdot H(k)\]
where H(i) are the elementary reflectors stored in the lower triangular part of A(IA:IA+M-1, JA:JA+K-1) as returned by cusolverMpGeqrf(), with corresponding scalar factors in d_tau.

Requires M >= N >= K >= 0. When K = 0, the routine sets Q to the identity matrix.

On input, d_A contains the Householder reflectors and d_tau as output by cusolverMpGeqrf(). On output, the submatrix A(IA:IA+M-1, JA:JA+N-1) is overwritten with the first N columns of Q.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

M

Host

In

Number of rows of the matrix Q. M >= 0.

N

Host

In

Number of columns of the matrix Q. M >= N >= 0.

K

Host

In

Number of elementary reflectors. N >= K >= 0.

d_A

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). On entry, contains the Householder reflectors as returned by cusolverMpGeqrf(). On exit, overwritten with the first N columns of Q.

IA

Host

In

Row index of the first row of the submatrix. IA >= 1.

JA

Host

In

Column index of the first column of the submatrix. JA >= 1.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_tau

Device

In

Pointer into the local memory to an array of dimension LOCc(JA+K-1). Contains the scalar factors of the Householder reflectors as returned by cusolverMpGeqrf(). Must be non-null when K > 0, even on ranks that own no local tau elements.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpOrgqr_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpOrgqr_bufferSize().

d_info

Device

Out

d_info = 0 on success. d_info = -i indicates the i-th argument (in LAPACK parameter ordering) has an invalid value.

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpOrgqr_bufferSize#

cusolverStatus_t cusolverMpOrgqr_bufferSize(
        cusolverMpHandle_t handle,
        int64_t M,
        int64_t N,
        int64_t K,
        const void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        const void *d_tau,
        cudaDataType_t computeType,
        size_t *workspaceInBytesOnDevice,
        size_t *workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpOrgqr().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

M

Host

In

Number of rows of the matrix Q. M >= 0.

N

Host

In

Number of columns of the matrix Q. M >= N >= 0.

K

Host

In

Number of elementary reflectors. N >= K >= 0.

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)).

IA

Host

In

Row index of the first row of the submatrix. IA >= 1.

JA

Host

In

Column index of the first column of the submatrix. JA >= 1.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_tau

Device

In

Pointer into the local memory to an array of dimension LOCc(JA+K-1).

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cusolverMpOrgqr().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cusolverMpOrgqr().

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpGels#

cusolverStatus_t cusolverMpGels(
        cusolverMpHandle_t handle,
        cublasOperation_t trans,
        int64_t M,
        int64_t N,
        int64_t NRHS,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        void *d_B,
        int64_t IB,
        int64_t JB,
        cusolverMpMatrixDescriptor_t descB,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
Solves overdetermined or underdetermined linear systems involving a distributed M-by-N matrix sub(A) denoting A(IA:IA+M-1, JA:JA+N-1) or its transpose, using QR or LQ factorization of sub(A).
Note that the solution of overdetermined systems (M >= N) with a no-transpose option is only supported via QR factorization cusolverMpGeqrf().
\[X \leftarrow \mbox{argmin} | sub(B) - sub(A) \cdot X |\]
where sub(B) is a distributed M-by-NRHS multi-vector denoting B(IB:IB+M-1, JB:JB+NRHS-1) and the solution multi-vector X is overwritten on the sub(B).

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

trans

Host

In

Indicate that the linear system of sub(A) involves with no-transpose or (conj)transpose.

M

Host

In

Number of rows of sub(A).

N

Host

In

Number of columns of sub(A).

NRHS

Host

In

Number of right hand side vectors i.e., number of columns of sub(B) and X.

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)).

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A). JA must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_B

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_B, LOCc(JB+NRHS-1)). On entry, the array contains the local pieces of the M-by-NRHS distributed matrix sub(B). On exit, the sub(B) is overwritten by the solution of the solution vectors.

IB

Host

In

Row index of the first row of the sub(B). IB must be a multiple of the row blocking dimension MB_B.

JB

Host

In

Column index of the first column of the sub(B). JB must be a multiple of the column blocking dimension NB_B.

descB

Host

In

Matrix descriptor associated to the global matrix B.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpGels_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpGels_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function.

This function requires square block size (MB_A == NB_A) and alignment of sub(A) and sub(B) matrices, meaning (MB_A == MB_B) and (IA == IB).
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpGels_bufferSize#

cusolverStatus_t cusolverMpGels_bufferSize(
        cusolverMpHandle_t handle,
        cublasOperation_t trans,
        int64_t M,
        int64_t N,
        int64_t NRHS,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        void *d_B,
        int64_t IB,
        int64_t JB,
        cusolverMpMatrixDescriptor_t descB,
        cudaDataType_t computeType,
        size_t* workspaceInBytesOnDevice,
        size_t* workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpGels().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

trans

Host

In

Indicate that the linear system of sub(A) involves with no-transpose or (conj)transpose.

M

Host

In

Number of rows of sub(A).

N

Host

In

Number of columns of sub(A).

NRHS

Host

In

Number of right hand side vectors i.e., number of columns of sub(B) and X.

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)).

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A). JA must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_B

Device

In

Pointer into the local memory to an array of dimension (LLD_B, LOCc(JB+NRHS-1)).

IB

Host

In

Row index of the first row of the sub(B). IB must be a multiple of the row blocking dimension MB_B.

JB

Host

In

Column index of the first column of the sub(B). JB must be a multiple of the column blocking dimension NB_B.

descB

Host

In

Matrix descriptor associated to the global matrix B.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

The size in bytes of the local device workspace needed by cusolverMpGels().

workspaceInBytesOnHost

Host

Out

The size in bytes of the local host workspace needed by cusolverMpGels().

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpSytrd#

cusolverStatus_t cusolverMpSytrd(
        cusolverMpHandle_t handle,
        cublasFillMode_t uplo,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        void *d_d,
        void *d_e,
        void *d_tau,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
Reduces a symmetric (or hermitian for a complex value type) distributed N-by-N matrix sub(A) denoting A(IA:IA+N-1, JA:JA+N-1) to a tridiagonal form.
\[A \rightarrow Q \cdot T \cdot Q^H\]
Currently, the function is implemented for CUBLAS_FILL_MODE_LOWER only.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

uplo

Host

In

Indicate that the function uses either upper or lower triangular part of sub(A).

N

Host

In

Number of rows/columns of square matrix sub(A).

d_A

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). On entry, the array contains the local part of symmetric distributed matrix sub(A). On exit, if the CUBLAS_FILL_MODE_UPPER is set, the diagonal and first superdiagonal of the tridiagonal of sub(A) is overwritten by the corresponding tridiagonal matrix, and Householder reflectors are stored above the superdiagonal of sub(A). If CUBLAS_FILL_MODE_LOWER is set, the diagonal and first subdiagonal of sub(A) is overwritten by the corresponding tridiagonal matrix, and Householder reflectors are stored below the subdiagonal of sub(A).

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A). JA must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_d

Device

Out

Pointer into the local memory to an array of dimension LOCc(JA+N-1). The diagonal elements of tridiagonal matrix is stored: d(i) = A(i,i).

d_e

Device

Out

Pointer into the local memory to an array of dimension LOCc(JA+N-1). The off-diagonal elements of tridiagonal matrix is stored: e(i) = A(i,i+1) for CUBLAS_FILL_MODE_UPPER and e(i) = A(i+1,i) for CUBLAS_FILL_MODE_LOWER.

d_tau

Device

Out

Pointer into the local memory to an array of dimension LOCc(JA+N-1). This array contains scalar factors of the Householder reflectors

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpSytrd_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpSytrd_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function.

This function requires square block size (MB_A == NB_A).
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpSytrd_bufferSize#

cusolverStatus_t cusolverMpSytrd_bufferSize(
        cusolverMpHandle_t handle,
        cublasFillMode_t uplo,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        void *d_d,
        void *d_e,
        void *d_tau,
        cudaDataType_t computeType,
        size_t *workspaceInBytesOnDevice,
        size_t *workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpSytrd().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

uplo

Host

In

Indicate that the function uses either upper or lower triangular part of sub(A).

N

Host

In

Number of rows/columns of square matrix sub(A).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)).

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A). JA must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_d

Device

In

Pointer into the local memory to an array of dimension LOCc(JA+N-1).

d_e

Device

In

Pointer into the local memory to an array of dimension LOCc(JA+N-1).

d_tau

Device

In

Pointer into the local memory to an array of dimension LOCc(JA+N-1).

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

The size in bytes of the local device workspace needed by cusolverMpSytrd().

workspaceInBytesOnHost

Host

Out

The size in bytes of the local host workspace needed by cusolverMpSytrd().

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpStedc#

cusolverStatus_t cusolverMpStedc(
        cusolverMpHandle_t handle,
        char *compz,
        int64_t N,
        void *d_d,
        void *d_e,
        void *d_Q,
        int64_t IQ,
        int64_t JQ,
        cusolverMpMatrixDescriptor_t descQ,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
Computes all eigenvalues and eigenvectors of a symmetric tridiagonal matrix using the divide and conquer algorithm.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

compz

Host

In

Option to compute eigenvalues only(compz=N) or both eigenvalues/vectors(compz=I). Currently, I is only implemented.

N

Host

In

Number of rows/columns of square matrix sub(A).

d_d

Device

In/Out

Pointer to an array of dimension N. On entry, the array contains diagonal elements of the tridiagonal matrix. On exit, the eigenvalues are stored in descending order.

d_e

Device

In/Out

Pointer to an array of dimension N-1. On entry, the array contains subdiagonal elements of the tridiagonal matrix. On exit, the content of the array is destroyed.

d_Q

Device

Out

Pointer into the local memory to an array of dimension (LLD_Q, LOCc(JQ+N-1)). On output, the array contains the local elements of orthonormal eigenvectors of the symmetric diagonal matrix.

IQ

Host

In

Row index of the first row of the sub(Q). IQ must be a multiple of the row blocking dimension MB_Q.

JQ

Host

In

Column index of the first column of the sub(A). JQ must be a multiple of the column blocking dimension NB_Q.

descQ

Host

In

Matrix descriptor associated to the global matrix Q.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpStedc_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpStedc_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function.

This function requires a square block size for Q, (MB_Q == NB_Q).
This routine supports the following combinations of data types:

Data Type of Tridiagonal Matrix

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.
Note that this function uses the divide and conquer algorithm to compute eigen pairs. Currently, the function can be failed when the blocksize is not a power of two. This is a known issue and we will address the problem later.

cusolverMpStedc_bufferSize#

cusolverStatus_t cusolverMpStedc_bufferSize(
        cusolverMpHandle_t handle,
        char *compz,
        int64_t N,
        void *d_d,
        void *d_e,
        void *d_Q,
        int64_t IQ,
        int64_t JQ,
        cusolverMpMatrixDescriptor_t descQ,
        cudaDataType_t computeType,
        size_t *workspaceInBytesOnDevice,
        size_t *workspaceInBytesOnHost,
        int *iwork)
Computes the size in bytes of the host and device working buffers required by cusolverMpStedc().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

compz

Host

In

Option to compute eigenvalues only(compz=N) or both eigenvalues/vectors(compz=I). Currently, I is only implemented.

N

Host

In

Number of rows/columns of square matrix sub(A).

d_d

Device

In

Pointer to an array of dimension N.

d_e

Device

In

Pointer to an array of dimension N-1.

d_Q

Device

In

Pointer into the local memory to an array of dimension (LLD_Q, LOCc(JQ+N-1)).

IQ

Host

In

Row index of the first row of the sub(Q). IQ must be a multiple of the row blocking dimension MB_Q.

JQ

Host

In

Column index of the first column of the sub(A). JQ must be a multiple of the column blocking dimension NB_Q.

descQ

Host

In

Matrix descriptor associated to the global matrix Q.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

The size in bytes of the local device workspace needed by the routine cusolverMpStedc().

workspaceInBytesOnHost

Host

Out

The size in bytes of the local host workspace needed by cusolverMpStedc().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function.

This routine currently supports the following combinations of data types:

Data Type of Tridiagonal Matrix

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpOrmtr#

cusolverStatus_t cusolverMpOrmtr(
        cusolverMpHandle_t handle,
        cublasSideMode_t side,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        int64_t M,
        int64_t N,
        const void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        const void *d_tau,
        void *d_C,
        int64_t IC,
        int64_t JC,
        cusolverMpMatrixDescriptor_t descC,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
Multiply distributed M-by-N matrix sub(C) denoting C(IC:IC+M-1, JC:JC+N-1) by the orthogonal matrix Q can be given from cusolverMpSytrd().
The function can perform the following matrix product and overwrite the result on sub(C) for the side parameter CUBLAS_SIDE_LEFT and CUBLAS_SIDE_RIGHT.
\[\begin{split}sub(C) = op(Q) \cdot sub(C) \\ sub(C) = sub(C) \cdot op(Q)\end{split}\]
The op can be translated to \(Q\), \(Q^T\), \(Q^H\) based on the trans argument CUBLAS_OP_N, CUBLAS_OP_T and CUBLAS_OP_H.
Q is an orthogonal matrix formed as the product of Householder reflectors as follows for the uplo parameter CUBLAS_FILL_MODE_UPPER and CUBLAS_FILL_MODE_LOWER
\[\begin{split}Q = H(1) H(2) ... H(nq-1) \\ Q = H(nq-1) H(nq-2) ... H(1)\end{split}\]
where nq is either m or n according to the side parameter of CUBLAS_SIDE_LEFT or CUBLAS_SIDE_RIGHT respectively.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

side

Host

In

Indicate that Q is applied from left or right side.

uplo

Host

In

Indicate that upper or lower triangular of sub(A) contains Householder reflectors.

trans

Host

In

Indicate that Q is applied with no-transpose or (conj)transpose.

M

Host

In

Number of rows of sub(A).

N

Host

In

Number of columns of sub(A).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). This array contains Householder reflectors below of diagonals with tau vector

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A).`JA` must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_tau

Device

In

Pointer into the local memory to an array of dimension LOCc(JA+N-1). This array contains scalar factors of the Householder reflectors

d_C

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_C, LOCc(JC+N-1)). On entry, the array contains the local pieces of the M-by-N distributed matrix sub(C). On exit, the sub(C) is overwritten by op(Q)*sub(C) or sub(C)*op(Q).

IC

Host

In

Row index of the first row of the sub(C). IC must be a multiple of the row blocking dimension MB_C.

JC

Host

In

Column index of the first column of the sub(C). JC must be a multiple of the column blocking dimension NB_C.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpOrmtr_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpOrmtr_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function.

This function requires square block size (MB_A == NB_A) and alignment of sub(A) and sub(B) matrices, meaning (MB_A == MB_C) and (IA == IC).
This routine supports the following combinations of data types:

Data Type of A and C

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpOrmtr_bufferSize#

cusolverStatus_t cusolverMpOrmtr_bufferSize(
        cusolverMpHandle_t handle,
        cublasSideMode_t side,
        cublasFillMode_t uplo,
        cublasOperation_t trans,
        int64_t M,
        int64_t N,
        const void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        const void *d_tau,
        void *d_C,
        int64_t IC,
        int64_t JC,
        cusolverMpMatrixDescriptor_t descC,
        cudaDataType_t computeType,
        size_t* workspaceInBytesOnDevice,
        size_t* workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpOrmtr().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

side

Host

In

Indicate that Q is applied from left or right side.

uplo

Host

In

Indicate that upper or lower triangular of sub(A) contains Householder reflectors.

trans

Host

In

Indicate that Q is applied with no-transpose or (conj)transpose.

M

Host

In

Number of rows of sub(A).

N

Host

In

Number of columns of sub(A).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). This array contains Householder reflectors below of diagonals with tau vector

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A).`JA` must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_tau

Device

In

Pointer into the local memory to an array of dimension LOCc(JA+N-1). This array contains scalar factors of the Householder reflectors

d_C

Device

In

Pointer into the local memory to an array of dimension (LLD_C, LOCc(JC+N-1)). On entry, the array contains the local pieces of the M-by-N distributed matrix sub(C). On exit, the sub(C) is overwritten by op(Q)*sub(C) or sub(C)*op(Q) based on the side input CUBLAS_SIDE_LEFT or CUBLAS_SIDE_RIGHT respectively.

IC

Host

In

Row index of the first row of the sub(C). IC must be a multiple of the row blocking dimension MB_C.

JC

Host

In

Column index of the first column of the sub(C).`JC` must be a multiple of the column blocking dimension NB_C.

descC

Host

In

Matrix descriptor associated to the global matrix C.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

The size in bytes of the local device workspace needed by cusolverMpOrmtr().

workspaceInBytesOnHost

Host

Out

The size in bytes of the local host workspace needed by cusolverMpOrmtr().

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpSyevd#

cusolverStatus_t cusolverMpSyevd(
        cusolverMpHandle_t handle,
        char *jobz,
        cublasFillMode_t uplo,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        void *d_d,
        void *d_Z,
        int64_t IZ,
        int64_t JZ,
        cusolverMpMatrixDescriptor_t descZ,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
Computes all eigenvalues and optionally eigenvectors of a symmetric distributed N-by-N matrix sub(A) A(IA:IA+N-1, JA:JA+N-1) using the divide and conquer algorithm cusolverMpStedc(). Note that the current implementation of the cusolverMpStedc may fail when the blocksize is not a power of two.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

jobz

Host

In

If jobz = N, then eigenvalues are computed and if jobz = V, then eigenvalues and eigenvectors are computed.

uplo

Host

In

Indicate that upper or lower triangular of sub(A) is used to compute eigen solutions.

N

Host

In

Number of rows and columns of sub(A).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). This array contains local parts of the symmetric matrix A.

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A).`JA` must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_d

Device

Out

Pointer into the memory to an array of global size N. On exit, this array contains real eigen values of the matrix A.

d_Z

Device

Out

Pointer into the local memory to an array of dimension (LLD_Z, LOCc(JZ+N-1)). On exit, the array local parts of orthonormal eigenvectors of the matrix A.

IZ

Host

In

Row index of the first row of the sub(Z). IZ must be a multiple of the row blocking dimension MB_Z.

JZ

Host

In

Column index of the first column of the sub(Z). JZ must be a multiple of the column blocking dimension NB_Z.

descZ

Host

In

Matrix descriptor associated to the global matrix Z.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpSyevd_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpSyevd_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function.

This function requires square block size (MB_A == NB_A) and alignment of sub(A) and sub(B) matrices, meaning (MB_A == MB_Z) and (IZ == IZ).
This routine supports the following combinations of data types:

Data Type of A and C

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpSyevd_bufferSize#

cusolverStatus_t cusolverMpSyevd_bufferSize(
        cusolverMpHandle_t handle,
        char *jobz,
        cublasFillMode_t uplo,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        void *d_d,
        void *d_Z,
        int64_t IZ,
        int64_t JZ,
        cusolverMpMatrixDescriptor_t descZ,
        cudaDataType_t computeType,
        size_t *workspaceInBytesOnDevice,
        size_t *workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpSyevd().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

jobz

Host

In

If jobz = N, then eigenvalues are computed and if jobz = V, then eigenvalues and eigenvectors are computed.

uplo

Host

In

Indicate that upper or lower triangular of sub(A) is used to compute eigen solutions.

N

Host

In

Number of rows and columns of sub(A).

d_A

Device

In

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). This array contains local parts of the symmetric matrix A.

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A).`JA` must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_d

Device

In

Pointer into the memory to an array of global size N. On exit, this array contains real eigen values of the matrix A.

d_Z

Device

In

Pointer into the local memory to an array of dimension (LLD_Z, LOCc(JZ+N-1)). On exit, the array local parts of orthonormal eigenvectors of the matrix A.

IZ

Host

In

Row index of the first row of the sub(Z). IZ must be a multiple of the row blocking dimension MB_Z.

JZ

Host

In

Column index of the first column of the sub(Z). JZ must be a multiple of the column blocking dimension NB_Z.

descZ

Host

In

Matrix descriptor associated to the global matrix Z.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

The size in bytes of the local device workspace needed by cusolverMpSyevd().

workspaceInBytesOnHost

Host

Out

The size in bytes of the local host workspace needed by cusolverMpSyevd().

This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpSygst#

cusolverStatus_t cusolverMpSygst(
        cusolverMpHandle_t handle,
        cusolverEigType_t ibtype,
        cublasFillMode_t uplo,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        const void *d_B,
        int64_t IB,
        int64_t JB,
        cusolverMpMatrixDescriptor_t descB,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
Reduces a hermitian-definite generalized eigenproblem to standard form. Denoting sub(A) and sub(B) as A(IA:IA+N-1, JA:JA+N-1) and B(IB:IB+N-1, JB:JB+N-1) respectively, the routine considers the following cases.
  • ibtype = CUSOLVER_EIG_TYPE_1: the problem is sub(A)*x = lambda*sub(B)*x, and sub(A) is overwritten by inv(L)*sub(A)*inv(L^H) or inv(U^H)*sub(A)*inv(U).

  • ibtype = CUSOLVER_EIG_TYPE_2 or 3: the problem is sub(A)*sub(B)*x = lambda*x or sub(B)*sub(A)*x = lambda*x, and sub(A) is overwritten by L^H*sub(A)*L or U*sub(A)*U^H.

The sub(B) includes lower or upper Cholesky factors previously computed by cusolverMpPotrf().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

ibtype

Host

In

Indicate the eigen problem type sub(A)*x=(lambda)*sub(B)*x, sub(A)*sub(B)x=(lambda)*x, or sub(B)*sub(A)*x=(lambda)*x.

uplo

Host

In

Indicate that lower CUBLAS_FILL_MODE_LOWER or upper CUBLAS_FILL_MODE_UPPER triangular of sub(A) and sub(B) are used to compute eigen solutions.

N

Host

In

Number of rows and columns of sub(A) and sub(B).

d_A

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). This array contains local parts of the symmetric matrix A.

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A). JA must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_B

Device

In

Pointer into the local memory to an array of dimension (LLD_B, LOCc(JB+N-1)). This array contains local parts of the symmetric matrix B.

IB

Host

In

Row index of the first row of the sub(B). IB must be a multiple of the row blocking dimension MB_B.

JB

Host

In

Column index of the first column of the sub(B). JB must be a multiple of the column blocking dimension NB_B.

descB

Host

In

Matrix descriptor associated to the global matrix B.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by cusolverMpSygst().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by cusolverMpSygst().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function.

The routine requires some alignment properties
  • Same square blocksize is used (MB == NB) for the matrix A and B.

  • The beginning row and column of A and B are aligned each other i.e., (IA == IB) and (JA == JB.

Note that the current implementation supports the inputs of ibtype = CUSOLVER_EIG_TYPE_1, uplo = CUBLAS_FILL_MODE_LOWER.
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpSygst_bufferSize#

cusolverStatus_t cusolverMpSygst_bufferSize(
        cusolverMpHandle_t handle,
        cusolverEigType_t ibtype,
        cublasFillMode_t uplo,
        int64_t N,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        int64_t IB,
        int64_t JB,
        cusolverMpMatrixDescriptor_t descB,
        cudaDataType_t computeType,
        size_t *workspaceInBytesOnDevice,
        size_t *workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpSygst().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

ibtype

Host

In

Indicate the eigen problem type sub(A)*x=(lambda)*sub(B)*x, sub(A)*sub(B)x=(lambda)*x, or sub(B)*sub(A)*x=(lambda)*x.

uplo

Host

In

Indicate that lower CUBLAS_FILL_MODE_LOWER or upper CUBLAS_FILL_MODE_UPPER triangular of sub(A) and sub(B) are used to compute eigen solutions.

N

Host

In

Number of rows and columns of sub(A) and sub(B).

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A). JA must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

IB

Host

In

Row index of the first row of the sub(B). IB must be a multiple of the row blocking dimension MB_B.

JB

Host

In

Column index of the first column of the sub(B). JB must be a multiple of the column blocking dimension NB_B.

descB

Host

In

Matrix descriptor associated to the global matrix B.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

The size in bytes of the local device workspace needed by cusolverMpSygst().

workspaceInBytesOnHost

Host

Out

The size in bytes of the local host workspace needed by cusolverMpSygst().

The routine requires some alignment properties
  • Same square blocksize is used (MB == NB) for the matrix A and B.

  • The beginning row and column of A and B are aligned each other i.e., (IA == IB) and (JA == JB.

Note that the current implementation supports the inputs of ibtype = CUSOLVER_EIG_TYPE_1, uplo = CUBLAS_FILL_MODE_LOWER.
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpSygvd#

cusolverStatus_t cusolverMpSygvd(
        cusolverMpHandle_t handle,
        cusolverEigType_t ibtype,
        cusolverEigMode_t jobz,
        cublasFillMode_t uplo,
        int64_t N,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        void *d_B,
        int64_t IB,
        int64_t JB,
        cusolverMpMatrixDescriptor_t descB,
        void *d_d,
        void *d_Z,
        int64_t IZ,
        int64_t JZ,
        cusolverMpMatrixDescriptor_t descZ,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *info)
Computes a hermitian-definite generalized eigenproblem using cusolverMpSyevd(). Denoting sub(A) and sub(B) as A(IA:IA+N-1, JA:JA+N-1) and B(IB:IB+N-1, JB:JB+N-1) respectively, the routine considers the following cases.
  • ibtype = CUSOLVER_EIG_TYPE_1: the problem is sub(A)*x = lambda*sub(B)*x.

  • ibtype = CUSOLVER_EIG_TYPE_2: the problem is sub(A)*sub(B)*x = lambda*x.

  • ibtype = CUSOLVER_EIG_TYPE_3: the problem is sub(B)*sub(A)*x = lambda*x.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

ibtype

Host

In

Indicate the eigen problem type sub(A)*x=(lambda)*sub(B)*x, sub(A)*sub(B)x=(lambda)*x, or sub(B)*sub(A)*x=(lambda)*x.

jobz

Host

In

Indicate whether the routine computes eigenvalues only CUSOLVER_EIG_MODE_NOVECTOR or includes eigenvectors as well CUSOLVER_EIG_MODE_VECTOR.

uplo

Host

In

Indicate that lower CUBLAS_FILL_MODE_LOWER or upper CUBLAS_FILL_MODE_UPPER triangular of sub(A) and sub(B) are used to compute eigen solutions.

N

Host

In

Number of rows and columns of sub(A) and sub(B).

d_A

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). This array contains local parts of the symmetric matrix A and will be overwritten with standard eigen problem.

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A). JA must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_B

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_B, LOCc(JB+N-1)). This array contains local parts of the symmetric matrix B and will be overwritten with Cholesky factors.

IB

Host

In

Row index of the first row of the sub(B). IB must be a multiple of the row blocking dimension MB_B.

JB

Host

In

Column index of the first column of the sub(B). JB must be a multiple of the column blocking dimension NB_B.

descB

Host

In

Matrix descriptor associated to the global matrix B.

d_d

Device

Out

Pointer into the memory to an array of global size N. On exit, this array contains real eigen values of the matrix A.

d_Z

Device

Out

Pointer into the local memory to an array of dimension (LLD_Z, LOCc(JZ+N-1)). On exit, the array local parts of orthonormal eigenvectors of the matrix A.

IZ

Host

In

Row index of the first row of the sub(Z). IZ must be a multiple of the row blocking dimension MB_Z.

JZ

Host

In

Column index of the first column of the sub(Z). JZ must be a multiple of the column blocking dimension NB_Z.

descZ

Host

In

Matrix descriptor associated to the global matrix Z.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpSygvd_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpSygvd_bufferSize().

info

Device

Out

info < 0 indicates an incorrect value of the i-th argument of the function.

The routine requires some alignment properties
  • Same square blocksize is used (MB == NB) for the matrix A, B, and Z.

  • The beginning row and column of A, B and Z are aligned each other i.e., (IA == IB == IZ) and (JA == JB == JZ.

Note that the current implementation supports the inputs of ibtype = CUSOLVER_EIG_TYPE_1, jobz = CUSOLVER_EIG_MODE_VECTOR, uplo = CUBLAS_FILL_MODE_LOWER.
This routine supports the following combinations of data types:

Data Type of A and C

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpSygvd_bufferSize#

cusolverStatus_t cusolverMpSygvd_bufferSize(
        cusolverMpHandle_t handle,
        cusolverEigType_t ibtype,
        cusolverEigMode_t jobz,
        cublasFillMode_t uplo,
        int64_t N,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        int64_t IB,
        int64_t JB,
        cusolverMpMatrixDescriptor_t descB,
        int64_t IZ,
        int64_t JZ,
        cusolverMpMatrixDescriptor_t descZ,
        cudaDataType_t computeType,
        size_t *workspaceInBytesOnDevice,
        size_t *workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpSygvd().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

ibtype

Host

In

Indicate the eigen problem type sub(A)*x=(lambda)*sub(B)*x, sub(A)*sub(B)x=(lambda)*x, or sub(B)*sub(A)*x=(lambda)*x.

jobz

Host

In

Indicate whether the routine computes eigenvalues only CUSOLVER_EIG_MODE_NOVECTOR or includes eigenvectors as well CUSOLVER_EIG_MODE_VECTOR.

uplo

Host

In

Indicate that lower CUBLAS_FILL_MODE_LOWER or upper CUBLAS_FILL_MODE_UPPER triangular of sub(A) and sub(B) are used to compute eigen solutions.

N

Host

In

Number of rows and columns of sub(A) and sub(B).

IA

Host

In

Row index of the first row of the sub(A). IA must be a multiple of the row blocking dimension MB_A.

JA

Host

In

Column index of the first column of the sub(A). JA must be a multiple of the column blocking dimension NB_A.

descA

Host

In

Matrix descriptor associated to the global matrix A.

IB

Host

In

Row index of the first row of the sub(B). IB must be a multiple of the row blocking dimension MB_B.

JB

Host

In

Column index of the first column of the sub(B). JB must be a multiple of the column blocking dimension NB_B.

descB

Host

In

Matrix descriptor associated to the global matrix B.

IZ

Host

In

Row index of the first row of the sub(Z). IZ must be a multiple of the row blocking dimension MB_Z.

JZ

Host

In

Column index of the first column of the sub(Z). JZ must be a multiple of the column blocking dimension NB_Z.

descZ

Host

In

Matrix descriptor associated to the global matrix Z.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

workspaceInBytesOnDevice

Host

Out

The size in bytes of the local device workspace needed by cusolverMpSygvd().

workspaceInBytesOnHost

Host

Out

The size in bytes of the local host workspace needed by cusolverMpSygvd().

The routine requires some alignment properties
  • Same square blocksize is used (MB == NB) for the matrix A, B, and Z.

  • The beginning row and column of A, B and Z are aligned each other i.e., (IA == IB == IZ) and (JA == JB == JZ.

Note that the current implementation supports the inputs of ibtype = CUSOLVER_EIG_TYPE_1, jobz = CUSOLVER_EIG_MODE_VECTOR, uplo = CUBLAS_FILL_MODE_LOWER.
This routine supports the following combinations of data types:

Data Type of A

computeType

Output Data Type

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_64F

CUDA_C_64F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpLaset#

cusolverStatus_t cusolverMpLaset(
        cusolverMpHandle_t handle,
        cublasFillMode_t uplo,
        int64_t M,
        int64_t N,
        const void *alpha,
        const void *beta,
        void *d_A,
        int64_t IA,
        int64_t JA,
        cusolverMpMatrixDescriptor_t descA,
        int *d_info)
Initializes the off-diagonal elements of the M-by-N distributed submatrix A(IA:IA+M-1, JA:JA+N-1) with alpha and the diagonal elements with beta. This is the distributed equivalent of LAPACK’s xLASET.

The uplo parameter controls which part of the submatrix is initialized:
  • CUBLAS_FILL_MODE_LOWER: only the lower triangular part (below and including the first subdiagonal) is set to alpha, and diagonal elements are set to beta.

  • CUBLAS_FILL_MODE_UPPER: only the upper triangular part (above and including the first superdiagonal) is set to alpha, and diagonal elements are set to beta.

  • CUBLAS_FILL_MODE_FULL: all off-diagonal elements are set to alpha, and diagonal elements are set to beta.

The alpha and beta scalars may reside in either host or device memory. The pointer type is detected automatically at runtime.
This routine requires no workspace and does not perform any inter-process communication.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

uplo

Host

In

Specifies the part of the submatrix to initialize: CUBLAS_FILL_MODE_LOWER, CUBLAS_FILL_MODE_UPPER, or CUBLAS_FILL_MODE_FULL.

M

Host

In

Number of rows of the submatrix. M >= 0.

N

Host

In

Number of columns of the submatrix. N >= 0.

alpha

Host/Device

In

Scalar value for off-diagonal elements. Must match the data type of matrix A.

beta

Host/Device

In

Scalar value for diagonal elements. Must match the data type of matrix A.

d_A

Device

Out

Pointer into the local memory to an array of dimension (LLD_A, LOCc(JA+N-1)). On exit, the specified elements of the distributed submatrix are overwritten.

IA

Host

In

Row index of the first row of the submatrix. IA >= 1.

JA

Host

In

Column index of the first column of the submatrix. JA >= 1.

descA

Host

In

Matrix descriptor associated to the global matrix A.

d_info

Device

Out

d_info = 0 on success. d_info = -i indicates the i-th argument (in LAPACK parameter ordering) has an invalid value.

This routine supports the following data types (determined by the matrix descriptor):

Data Type of A

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

See cusolverStatus_t for the description of the return status.

cusolverMpNewtonSchulz#

cusolverStatus_t cusolverMpNewtonSchulz(
        cusolverMpHandle_t handle,
        cusolverMpNewtonSchulzDescriptor_t nsDesc,
        int64_t M,
        int64_t N,
        void *d_X,
        int64_t IX,
        int64_t JX,
        const cusolverMpMatrixDescriptor_t descX,
        int64_t numberOfNewtonSchulzIterations,
        const void *h_coeffs,
        cudaDataType_t computeType,
        void *d_work,
        size_t workspaceInBytesOnDevice,
        void *h_work,
        size_t workspaceInBytesOnHost,
        int *d_info)
Performs Newton-Schulz iterations to orthogonalize a tall or square (M >= N) distributed matrix X(IX:IX+M-1, JX:JX+N-1) in-place on supported Px1 process grids. The routine approximates the orthogonal polar factor U from the polar decomposition X = U * H, where U has orthonormal columns.

The routine performs a fixed number of user-specified iterations and does not check for convergence. The quality of the approximation depends on the number of iterations, the choice of coefficients, and the conditioning of the input matrix.

Algorithm. For the currently supported tall/square case (M >= N), each iteration i applies a polynomial update using three user-supplied coefficients (alpha_i, beta_i, gamma_i):
\[X \leftarrow \alpha_i \cdot X + \beta_i \cdot X (X^T X) + \gamma_i \cdot X (X^T X)^2\]
The input must be normalized for the iterations to converge. By default, the routine normalizes the input by its Frobenius norm: X := X / ||X||_F. This step can be disabled via the descriptor when the input is already normalized.

The coefficients h_coeffs must be provided as a host array of float triplets, with 3 * numberOfNewtonSchulzIterations elements stored as [alpha_0, beta_0, gamma_0, alpha_1, beta_1, gamma_1, ...]. See the Newton-Schulz sample for example coefficients optimized for quintic convergence in 5 iterations. The classical Newton-Schulz iteration can be recovered by setting (alpha, beta, gamma) = (1.5, -0.5, 0.0) for each iteration, though more iterations will be needed to converge.

Performance tip. For best collective communication performance, allocate the device workspace with ncclMemAlloc and register it via cusolverMpBufferRegister() before calling this routine.

Limitations:
  • Only Px1 process grids (1D row distribution with numColDevices = 1) are supported. 2D block-cyclic grids are not yet implemented.

  • Only tall or square matrices (M >= N) are supported. Wide rectangular matrices (M < N) are not yet supported.

  • Only IX = JX = 1 is supported (no submatrix offsets).

  • Only CUDA_R_16BF (bfloat16) and CUDA_R_32F (float32) value types are supported.

  • The only supported compute type is CUDA_R_32F.

  • RSRC = CSRC = 0 is required.

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

nsDesc

Host

In

cusolverMpNewtonSchulzDescriptor_t descriptor (may be NULL for default behavior). See cusolverMpNewtonSchulzDescriptorCreate().

M

Host

In

Number of rows of the submatrix X. M >= 0.

N

Host

In

Number of columns of the submatrix X. N >= 0.

d_X

Device

In/Out

Pointer into the local memory to an array of dimension (LLD_X, LOCc(JX+N-1)). On entry, contains the input matrix. On exit, overwritten with the orthogonalized matrix.

IX

Host

In

Row index of the first row of the submatrix. IX >= 1.

JX

Host

In

Column index of the first column of the submatrix. JX >= 1.

descX

Host

In

Matrix descriptor associated to the global matrix X.

numberOfNewtonSchulzIterations

Host

In

Number of Newton-Schulz iterations to perform. >= 0. Typical value is 5.

h_coeffs

Host

In

Host array of triplets with 3 * numberOfNewtonSchulzIterations elements, storing the iteration coefficients (alpha, beta, gamma) for each iteration. The element type must match computeType.

computeType

Host

In

Data type used for computations. See table below for supported combinations.

d_work

Device

Out

Device workspace of size workspaceInBytesOnDevice.

workspaceInBytesOnDevice

Host

In

The size in bytes of the local device workspace needed by the routine as provided by cusolverMpNewtonSchulz_bufferSize().

h_work

Host

Out

Host workspace of size workspaceInBytesOnHost.

workspaceInBytesOnHost

Host

In

The size in bytes of the local host workspace needed by the routine as provided by cusolverMpNewtonSchulz_bufferSize().

d_info

Device

Out

d_info = 0 on success. d_info < 0 indicates invalid input.

This routine supports the following combinations of data types:

Data Type of X

computeType

Output Data Type

CUDA_R_16BF

CUDA_R_32F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

See cusolverStatus_t for the description of the return status.

cusolverMpNewtonSchulz_bufferSize#

cusolverStatus_t cusolverMpNewtonSchulz_bufferSize(
        cusolverMpHandle_t handle,
        cusolverMpNewtonSchulzDescriptor_t nsDesc,
        int64_t M,
        int64_t N,
        void *d_X,
        int64_t IX,
        int64_t JX,
        const cusolverMpMatrixDescriptor_t descX,
        int64_t numberOfNewtonSchulzIterations,
        const void *h_coeffs,
        cudaDataType_t computeType,
        size_t *workspaceInBytesOnDevice,
        size_t *workspaceInBytesOnHost)
Computes the size in bytes of the host and device working buffers required by cusolverMpNewtonSchulz().
This routine has the same input constraints, supported data types, and current limitations as cusolverMpNewtonSchulz().

Parameter

Memory

In/Out

Description

handle

Host

In

cuSOLVERMp library handle.

nsDesc

Host

In

Newton-Schulz descriptor (may be NULL).

M

Host

In

Number of rows of the submatrix X. M >= 0.

N

Host

In

Number of columns of the submatrix X. N >= 0.

d_X

Device

In

Pointer into the local memory to an array of dimension (LLD_X, LOCc(JX+N-1)).

IX

Host

In

Row index of the first row of the submatrix. IX >= 1.

JX

Host

In

Column index of the first column of the submatrix. JX >= 1.

descX

Host

In

Matrix descriptor associated to the global matrix X.

numberOfNewtonSchulzIterations

Host

In

Number of Newton-Schulz iterations to perform.

h_coeffs

Host

In

Host array of triplets with 3 * numberOfNewtonSchulzIterations elements, storing the iteration coefficients (alpha, beta, gamma) for each iteration. The element type must match computeType.

computeType

Host

In

Data type used for computations.

workspaceInBytesOnDevice

Host

Out

On output, contains the size in bytes of the local device workspace needed by cusolverMpNewtonSchulz().

workspaceInBytesOnHost

Host

Out

On output, contains the size in bytes of the local host workspace needed by cusolverMpNewtonSchulz().

This routine supports the same data type combinations as cusolverMpNewtonSchulz().
See cusolverStatus_t for the description of the return status.