cuSOLVER API Reference
The API reference guide for cuSOLVER, a GPU accelerated library for decompositions and linear system solutions for both dense and sparse matrices.
1. Introduction
The cuSolver library is a highlevel package based on the cuBLAS and cuSPARSE libraries. It consists of two modules corresponding to two sets of API:
The cuSolver API on a single GPU
The cuSolverMG API on a single node multiGPU
Each of these can be used independently or in concert with other toolkit libraries. To simplify the notation, cuSolver denotes single GPU API and cuSolverMg denotes multiGPU API.
The intent of cuSolver is to provide useful LAPACKlike features, such as common matrix factorization and triangular solve routines for dense matrices, a sparse leastsquares solver and an eigenvalue solver. In addition cuSolver provides a new refactorization library useful for solving sequences of matrices with a shared sparsity pattern.
cuSolver combines three separate components under a single umbrella. The first part of cuSolver is called cuSolverDN, and deals with dense matrix factorization and solve routines such as LU, QR, SVD and LDLT, as well as useful utilities such as matrix and vector permutations.
Next, cuSolverSP provides a new set of sparse routines based on a sparse QR factorization. Not all matrices have a good sparsity pattern for parallelism in factorization, so the cuSolverSP library also provides a CPU path to handle those sequentiallike matrices. For those matrices with abundant parallelism, the GPU path will deliver higher performance. The library is designed to be called from C and C++.
The final part is cuSolverRF, a sparse refactorization package that can provide very good performance when solving a sequence of matrices where only the coefficients are changed but the sparsity pattern remains the same.
The GPU path of the cuSolver library assumes data is already in the device memory. It is the responsibility of the developer to allocate memory and to copy data between GPU memory and CPU memory using standard CUDA runtime API routines, such as cudaMalloc()
, cudaFree()
, cudaMemcpy()
, and cudaMemcpyAsync()
.
cuSolverMg is GPUaccelerated ScaLAPACK. By now, cuSolverMg supports 1D column block cyclic layout and provides symmetric eigenvalue solver.
Note
The cuSolver library requires hardware with a CUDA compute capability (CC) of at least 2.0 or higher. Please see the CUDA C++ Programming Guide for a list of the Compute Capabilities corresponding to all NVIDIA GPUs.
1.1. cuSolverDN: Dense LAPACK
The cuSolverDN library was designed to solve dense linear systems of the form
\(Ax = b\) 
where the coefficient matrix \(A\in R^{nxn}\) , righthandside vector \(b\in R^{n}\) and solution vector \(x\in R^{n}\)
The cuSolverDN library provides QR factorization and LU with partial pivoting to handle a general matrix A
, which may be nonsymmetric. Cholesky factorization is also provided for symmetric/Hermitian matrices. For symmetric indefinite matrices, we provide BunchKaufman (LDL) factorization.
The cuSolverDN library also provides a helpful bidiagonalization routine and singular value decomposition (SVD).
The cuSolverDN library targets computationallyintensive and popular routines in LAPACK, and provides an API compatible with LAPACK. The user can accelerate these timeconsuming routines with cuSolverDN and keep others in LAPACK without a major change to existing code.
1.2. cuSolverSP: Sparse LAPACK
The cuSolverSP library was mainly designed to a solve sparse linear system
\(Ax = b\) 
and the leastsquares problem
\(x = {argmin}{}A*z  b{}\) 
where sparse matrix \(A\in R^{mxn}\) , righthandside vector \(b\in R^{m}\) and solution vector \(x\in R^{n}\) . For a linear system, we require m=n
.
The core algorithm is based on sparse QR factorization. The matrix A
is accepted in CSR format. If matrix A
is symmetric/Hermitian, the user has to provide a full matrix, ie fill missing lower or upper part.
If matrix A
is symmetric positive definite and the user only needs to solve \(Ax = b\) , Cholesky factorization can work and the user only needs to provide the lower triangular part of A
.
On top of the linear and leastsquares solvers, the cuSolverSP
library provides a simple eigenvalue solver based on shiftinverse power method, and a function to count the number of eigenvalues contained in a box in the complex plane.
1.3. cuSolverRF: Refactorization
The cuSolverRF library was designed to accelerate solution of sets of linear systems by fast refactorization when given new coefficients in the same sparsity pattern
\(A_{i}x_{i} = f_{i}\) 
where a sequence of coefficient matrices \(A_{i}\in R^{nxn}\) , righthandsides \(f_{i}\in R^{n}\) and solutions \(x_{i}\in R^{n}\) are given for i=1,...,k
.
The cuSolverRF library is applicable when the sparsity pattern of the coefficient matrices \(A_{i}\) as well as the reordering to minimize fillin and the pivoting used during the LU factorization remain the same across these linear systems. In that case, the first linear system (i=1
) requires a full LU factorization, while the subsequent linear systems (i=2,...,k
) require only the LU refactorization. The later can be performed using the cuSolverRF library.
Notice that because the sparsity pattern of the coefficient matrices, the reordering and pivoting remain the same, the sparsity pattern of the resulting triangular factors \(L_{i}\) and \(U_{i}\) also remains the same. Therefore, the real difference between the full LU factorization and LU refactorization is that the required memory is known ahead of time.
1.4. Naming Conventions
The cuSolverDN library provides two different APIs; legacy
and generic
.
The functions in the legacy API are available for data types float
, double
, cuComplex
, and cuDoubleComplex
. The naming convention for the legacy API is as follows:

where <t
> can be S
, D
, C
, Z
, or X
, corresponding to the data types float
, double
, cuComplex
, cuDoubleComplex
, and the generic type, respectively. <operation
> can be Cholesky factorization (potrf
), LU with partial pivoting (getrf
), QR factorization (geqrf
) and BunchKaufman factorization (sytrf
).
The functions in the generic API provide a single entry point for each routine and support for 64bit integers to define matrix and vector dimensions. The naming convention for the generic API is dataagnostic and is as follows:

where <operation
> can be Cholesky factorization (potrf
), LU with partial pivoting (getrf
) and QR factorization (geqrf
).
The cuSolverSP library functions are available for data types float
, double
, cuComplex
, and cuDoubleComplex
. The naming convention is as follows:

where cuSolverSp
is the GPU path and cusolverSpHost
is the corresponding CPU path. <t
> can be S
, D
, C
, Z
, or X
, corresponding to the data types float
, double
, cuComplex
, cuDoubleComplex
, and the generic type, respectively.
The <matrix data format
> is csr
, compressed sparse row format.
The <operation
> can be ls
, lsq
, eig
, eigs
, corresponding to linear solver, leastsquare solver, eigenvalue solver and number of eigenvalues in a box, respectively.
The <output matrix data format
> can be v
or m
, corresponding to a vector or a matrix.
<based on
> describes which algorithm is used. For example, qr
(sparse QR factorization) is used in linear solver and leastsquare solver.
All of the functions have the return type cusolverStatus_t
and are explained in more detail in the chapters that follow.
Routine 
Data format 
Operation 
Output format 
Based on 

































The cuSolverRF library routines are available for data type double
. Most of the routines follow the naming convention:

where the trailing optional Host qualifier indicates the data is accessed on the host versus on the device, which is the default. The <operation
> can be Setup
, Analyze
, Refactor
, Solve
, ResetValues
, AccessBundledFactors
and ExtractSplitFactors
.
Finally, the return type of the cuSolverRF library routines is cusolverStatus_t
.
1.5. Asynchronous Execution
The cuSolver library functions prefer to keep asynchronous execution as much as possible. Developers can always use the cudaDeviceSynchronize()
function to ensure that the execution of a particular cuSolver library routine has completed.
A developer can also use the cudaMemcpy()
routine to copy data from the device to the host and vice versa, using the cudaMemcpyDeviceToHost
and cudaMemcpyHostToDevice
parameters, respectively. In this case there is no need to add a call to cudaDeviceSynchronize()
because the call to cudaMemcpy()
with the above parameters is blocking and completes only when the results are ready on the host.
1.6. Library Property
The libraryPropertyType
data type is an enumeration of library property types. (ie. CUDA version X.Y.Z would yield MAJOR_VERSION=X
, MINOR_VERSION=Y
, PATCH_LEVEL=Z
)
typedef enum libraryPropertyType_t
{
MAJOR_VERSION,
MINOR_VERSION,
PATCH_LEVEL
} libraryPropertyType;
The following code can show the version of cusolver library.
int major=1,minor=1,patch=1;
cusolverGetProperty(MAJOR_VERSION, &major);
cusolverGetProperty(MINOR_VERSION, &minor);
cusolverGetProperty(PATCH_LEVEL, &patch);
printf("CUSOLVER Version (Major,Minor,PatchLevel): %d.%d.%d\n", major,minor,patch);
1.7. High Precision Package
The cusolver
library uses high precision for iterative refinement when necessary.
2. Using the CUSOLVER API
2.1. General Description
This chapter describes how to use the cuSolver library API. It is not a reference for the cuSolver API data types and functions; that is provided in subsequent chapters.
2.1.1. Thread Safety
The library is threadsafe, and its functions can be called from multiple host threads.
2.1.2. Scalar Parameters
In the cuSolver API, the scalar parameters can be passed by reference on the host.
2.1.3. Parallelism with Streams
If the application performs several small independent computations, or if it makes data transfers in parallel with the computation, then CUDA streams can be used to overlap these tasks.
The application can conceptually associate a stream with each task. To achieve the overlap of computation between the tasks, the developer should:
Create CUDA streams using the function
cudaStreamCreate()
, andSet the stream to be used by each individual cuSolver library routine by calling, for example,
cusolverDnSetStream()
, just prior to calling the actual cuSolverDN routine.
The computations performed in separate streams would then be overlapped automatically on the GPU, when possible. This approach is especially useful when the computation performed by a single task is relatively small, and is not enough to fill the GPU with work, or when there is a data transfer that can be performed in parallel with the computation.
2.1.4. How to Link cusolver Library
cusolver
library provides dynamic library libcusolver.so
and static library libcusolver_static.a
. If the user links the application with libcusolver.so
, libcublas.so
, libcublasLt.so
and libcusparse.so
are also required. If the user links the application with libcusolver_static.a
, the following libraries are also needed, libcudart_static.a
, libculibos.a
libcusolver_lapack_static.a
, libmetis_static.a
, libcublas_static.a
and libcusparse_static.a
.
2.1.5. Link Thirdparty LAPACK Library
Starting with CUDA 10.1 update 2, NVIDIA LAPACK library libcusolver_lapack_static.a
is a subset of LAPACK and only contains GPU accelerated stedc
and bdsqr
. The user has to link libcusolver_static.a
with libcusolver_lapack_static.a
in order to build the application successfully. Prior to CUDA 10.1 update 2, the user can replace libcusolver_lapack_static.a
with a thirdparty LAPACK library, for example, MKL. In CUDA 10.1 update 2, the thirdparty LAPACK library no longer affects the behavior of cusolver library, neither functionality nor performance. Furthermore the user cannot use libcusolver_lapack_static.a
as a standalone LAPACK library because it is only a subset of LAPACK.
If you use
libcusolver_static.a
, then you must link withlibcusolver_lapack_static.a
explicitly, otherwise the linker will report missing symbols. There are no symbol conflicts betweenlibcusolver_lapack_static.a
and other thirdparty LAPACK libraries, which allows linking the same application tolibcusolver_lapack_static.a
and another thirdparty LAPACK library.The
libcusolver_lapack_static.a
is built insidelibcusolver.so
. Hence, if you uselibcusolver.so
, then you don’t need to specify a LAPACK library. Thelibcusolver.so
will not pick up any routines from the thirdparty LAPACK library even if you link the application with it.
2.1.6. Convention of info
Each LAPACK routine returns an info
which indicates the position of invalid parameter. If info = i
, then ith parameter is invalid. To be consistent with base1 in LAPACK, cusolver
does not report invalid handle
into info
. Instead, cusolver
returns CUSOLVER_STATUS_NOT_INITIALIZED
for invalid handle
.
2.1.7. Usage of _bufferSize
There is no cudaMalloc inside cuSolver
library, the user must allocate the device workspace explicitly. The routine xyz_bufferSize
is to query the size of workspace of the routine xyz
, for example xyz = potrf
. To make the API simple, xyz_bufferSize
follows almost the same signature of xyz
even it only depends on some parameters, for example, device pointer is not used to decide the size of workspace. In most cases, xyz_bufferSize
is called in the beginning before actual device data (pointing by a device pointer) is prepared or before the device pointer is allocated. In such case, the user can pass null pointer to xyz_bufferSize
without breaking the functionality.
2.1.8. Deterministic Results
Throughout this documentation, a function is declared as deterministic if it computes the exact same bitwise results for every execution with the same input parameters, hard and software environment. Conversely, a nondeterministic function might compute bitwise different results due to a varying order of floating point operations, e.g., a sum s
of four values a
, b
, c
, d
can be computed in different orders:
s = (a + b) + (c + d)
s = (a + (b + c)) + d
s = a + (b + (c + d))
…
Due to the nonassociativity of floating point arithmetic, all results might be bitwise different.
By default, cuSolverDN computes deterministic results. For improved performance of some functions, it is possible to allow nondeterministic results with cusolverDnSetDeterministicMode()
.
2.2. cuSolver Types Reference
2.2.1. cuSolverDN Types
The float
, double
, cuComplex
, and cuDoubleComplex
data types are supported. The first two are standard C data types, while the last two are exported from cuComplex.h
. In addition, cuSolverDN uses some familiar types from cuBLAS.
2.2.1.1. cusolverDnHandle_t
This is a pointer type to an opaque cuSolverDN context, which the user must initialize by calling cusolverDnCreate()
prior to calling any other library function. An uninitialized Handle object will lead to unexpected behavior, including crashes of cuSolverDN. The handle created and returned by cusolverDnCreate()
must be passed to every cuSolverDN function.
2.2.1.2. cublasFillMode_t
The type indicates which part (lower or upper) of the dense matrix was filled and consequently should be used by the function. Its values correspond to Fortran characters ‘L’
or ‘l’
(lower) and ‘U’
or ‘u’
(upper) that are often used as parameters to legacy BLAS implementations.
Value 
Meaning 


The lower part of the matrix is filled. 

The upper part of the matrix is filled. 
2.2.1.3. cublasOperation_t
The cublasOperation_t
type indicates which operation needs to be performed with the dense matrix. Its values correspond to Fortran characters ‘N’
or ‘n’
(nontranspose), ‘T’
or ‘t’
(transpose) and ‘C’
or ‘c’
(conjugate transpose) that are often used as parameters to legacy BLAS implementations.
Value 
Meaning 


The nontranspose operation is selected. 

The transpose operation is selected. 

The conjugate transpose operation is selected. 
2.2.1.4. cusolverEigType_t
The cusolverEigType_t
type indicates which type of eigenvalue the solver is. Its values correspond to Fortran integer 1
(A*x = lambda*B*x), 2
(A*B*x = lambda*x), 3
(B*A*x = lambda*x), used as parameters to legacy LAPACK implementations.
Value 
Meaning 


A*x = lambda*B*x 

A*B*x = lambda*x 

B*A*x = lambda*x 
2.2.1.5. cusolverEigMode_t
The cusolverEigMode_t
type indicates whether or not eigenvectors are computed. Its values correspond to Fortran character 'N'
(only eigenvalues are computed), 'V'
(both eigenvalues and eigenvectors are computed) used as parameters to legacy LAPACK implementations.
Value 
Meaning 


Only eigenvalues are computed. 

Both eigenvalues and eigenvectors are computed. 
2.2.1.6. cusolverIRSRefinement_t
The cusolverIRSRefinement_t
type indicates which solver type would be used for the specific cusolver function. Most of our experimentation shows that CUSOLVER_IRS_REFINE_GMRES is the best option.
More details about the refinement process can be found in Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixedprecision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ‘18). IEEE Press, Piscataway, NJ, USA, Article 47, 11 pages.
Value 
Meaning 


Solver is not set; this value is what is set when creating the 

No refinement solver, the IRS solver performs a factorisation followed by a solve without any refinement. For example if the IRS solver was 

Classical iterative refinement solver. Similar to the one used in LAPACK routines. 

GMRES (Generalized Minimal Residual) based iterative refinement solver. In recent study, the GMRES method has drawn the scientific community attention for its ability to be used as refinement solver that outperforms the classical iterative refinement method. based on our experimentation, we recommend this setting. 

Classical iterative refinement solver that uses the GMRES (Generalized Minimal Residual) internally to solve the correction equation at each iteration. We call the classical refinement iteration the outer iteration while the GMRES is called inner iteration. Note that if the tolerance of the inner GMRES is set very low, lets say to machine precision, then the outer classical refinement iteration will performs only one iteration and thus this option will behave like 

Similar to 
2.2.1.7. cusolverDnIRSParams_t
This is a pointer type to an opaque cusolverDnIRSParams_t
structure, which holds parameters for the iterative refinement linear solvers such as cusolverDnXgesv()
. Use corresponding helper functions described below to either Create/Destroy this structure or Set/Get solver parameters.
2.2.1.8. cusolverDnIRSInfos_t
This is a pointer type to an opaque cusolverDnIRSInfos_t
structure, which holds information about the performed call to an iterative refinement linear solver (e.g., cusolverDnXgesv()
). Use corresponding helper functions described below to either Create/Destroy this structure or retrieve solve information.
2.2.1.9. cusolverDnFunction_t
The cusolverDnFunction_t
type indicates which routine needs to be configured by cusolverDnSetAdvOptions()
. The value CUSOLVERDN_GETRF
corresponds to the routine Getrf
.
Value 
Meaning 


Corresponds to 
2.2.1.10. cusolverAlgMode_t
The cusolverAlgMode_t
type indicates which algorithm is selected by cusolverDnSetAdvOptions()
. The set of algorithms supported for each routine is described in detail along with the routine’s documentation.
The default algorithm is CUSOLVER_ALG_0
. The user can also provide NULL
to use the default algorithm.
2.2.1.11. cusolverStatus_t
This is the same as cusolverStatus_t
in the sparse LAPACK section.
2.2.1.12. cusolverDeterministicMode_t
The cusolverDeterministicMode_t
type indicates whether multiple cuSolver function executions with the same input have the same bitwise equal result (deterministic) or might have bitwise different results (nondeterministic). In comparison to cublasAtomicsMode_t, which only includes the usage of atomic functions, cusolverDeterministicMode_t
includes all nondeterministic programming patterns. The deterministic mode can be set and queried using cusolverDnSetDeterministicMode()
and cusolverDnGetDeterministicMode()
routines, respectively.
Value 
Meaning 


compute deterministic results. 

Allow nondeterministic results. 
2.2.2. cuSolverSP Types
The float
, double
, cuComplex
, and cuDoubleComplex
data types are supported. The first two are standard C data types, while the last two are exported from cuComplex.h
.
2.2.2.1. cusolverSpHandle_t
This is a pointer type to an opaque cuSolverSP context, which the user must initialize by calling cusolverSpCreate()
prior to calling any other library function. An uninitialized Handle object will lead to unexpected behavior, including crashes of cuSolverSP. The handle created and returned by cusolverSpCreate()
must be passed to every cuSolverSP function.
2.2.2.2. cusparseMatDescr_t
We have chosen to keep the same structure as exists in cuSparse to describe the shape and properties of a matrix. This enables calls to either cuSPARSE or cuSOLVER using the same matrix description.
typedef struct {
cusparseMatrixType_t MatrixType;
cusparseFillMode_t FillMode;
cusparseDiagType_t DiagType;
cusparseIndexBase_t IndexBase;
} cusparseMatDescr_t;
Please read documenation of the cuSPARSE Library to understand each field of cusparseMatDescr_t
.
2.2.2.3. cusolverStatus_t
This is a status type returned by the library functions and it can have the following values.

The operation completed successfully. 

The cuSolver library was not initialized. This is usually caused by the lack of a prior call, an error in the CUDA Runtime API called by the cuSolver routine, or an error in the hardware setup. To correct: call 

Resource allocation failed inside the cuSolver library. This is usually caused by a To correct: prior to the function call, deallocate previously allocated memory as much as possible. 

An unsupported value or parameter was passed to the function (a negative vector size, for example). To correct: ensure that all the parameters being passed have valid values. 

The function requires a feature absent from the device architecture; usually caused by the lack of support for atomic operations or double precision. To correct: compile and run the application on a device with compute capability 2.0 or above. 

The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons. To correct: check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. 

An internal cuSolver operation failed. This error is usually caused by a To correct: check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine’s completion. 

The matrix type is not supported by this function. This is usually caused by passing an invalid matrix descriptor to the function. To correct: check that the fields in 
2.2.3. cuSolverRF Types
cuSolverRF only supports double
.
2.2.3.1. cusolverRfHandle_t
The cusolverRfHandle_t
is a pointer to an opaque data structure that contains the cuSolverRF library handle. The user must initialize the handle by calling cusolverRfCreate()
prior to any other cuSolverRF library calls. The handle is passed to all other cuSolverRF library calls.
2.2.3.2. cusolverRfMatrixFormat_t
The cusolverRfMatrixFormat_t
is an enum that indicates the input/output matrix format assumed by the cusolverRfSetupDevice()
, cusolverRfSetupHost()
, cusolverRfResetValues()
, cusolveRfExtractBundledFactorsHost()
and cusolverRfExtractSplitFactorsHost()
routines.
Value 
Meaning 


Matrix format CSR is assumed. (default) 

Matrix format CSC is assumed. 
2.2.3.3. cusolverRfNumericBoostReport_t
The cusolverRfNumericBoostReport_t
is an enum that indicates whether numeric boosting (of the pivot) was used during the cusolverRfRefactor()
and cusolverRfSolve()
routines. The numeric boosting is disabled by default.
Value 
Meaning 


Numeric boosting not used. (default) 

Numeric boosting used. 
2.2.3.4. cusolverRfResetValuesFastMode_t
The cusolverRfResetValuesFastMode_t
is an enum that indicates the mode used for the cusolverRfResetValues()
routine. The fast mode requires extra memory and is recommended only if very fast calls to cusolverRfResetValues()
are needed.
Value 
Meaning 


Fast mode disabled. (default) 

Ffast mode enabled. 
2.2.3.5. cusolverRfFactorization_t
The cusolverRfFactorization_t
is an enum that indicates which (internal) algorithm is used for refactorization in the cusolverRfRefactor()
routine.
Value 
Meaning 


Algorithm 0. (default) 

Algorithm 1. 

Algorithm 2. Dominobased scheme. 
2.2.3.6. cusolverRfTriangularSolve_t
The cusolverRfTriangularSolve_t
is an enum that indicates which (internal) algorithm is used for triangular solve in the cusolverRfSolve()
routine.
Value 
Meaning 


Algorithm 1. (default) 

Algorithm 2. Dominobased scheme. 

Aalgorithm 3. Dominobased scheme. 
2.2.3.7. cusolverRfUnitDiagonal_t
The cusolverRfUnitDiagonal_t
is an enum that indicates whether and where the unit diagonal is stored in the input/output triangular factors in the cusolverRfSetupDevice()
, cusolverRfSetupHost()
and cusolverRfExtractSplitFactorsHost()
routines.
Value 
Meaning 


Unit diagonal is stored in lower triangular factor. (default) 

Unit diagonal is stored in upper triangular factor. 

Unit diagonal is assumed in lower triangular factor. 

Unit diagonal is assumed in upper triangular factor. 
2.2.3.8. cusolverStatus_t
The cusolverStatus_t
is an enum that indicates success or failure of the cuSolverRF library call. It is returned by all the cuSolver library routines, and it uses the same enumerated values as the sparse and dense Lapack routines.
2.3. cuSolver Formats Reference
2.3.1. Index Base Format
The CSR or CSC format requires either zerobased or onebased index for a sparse matrix A
. The GLU library supports only zerobased indexing. Otherwise, both onebased and zerobased indexing are supported in cuSolver.
2.3.2. Vector (Dense) Format
The vectors are assumed to be stored linearly in memory. For example, the vector
\(x = \begin{pmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \\ \end{pmatrix}\) 
is represented as
\(\begin{pmatrix} x_{1} & x_{2} & \ldots & x_{n} \\ \end{pmatrix}\) 
2.3.3. Matrix (Dense) Format
The dense matrices are assumed to be stored in columnmajor order in memory. The submatrix can be accessed using the leading dimension of the original matrix. For examle, the m*n
(sub)matrix
\(\begin{pmatrix} a_{1,1} & \ldots & a_{1,n} \\ a_{2,1} & \ldots & a_{2,n} \\ \vdots & & \\ a_{m,1} & \ldots & a_{m,n} \\ \end{pmatrix}\) 
is represented as
\(\begin{pmatrix} a_{1,1} & \ldots & a_{1,n} \\ a_{2,1} & \ldots & a_{2,n} \\ \vdots & \ddots & \vdots \\ a_{m,1} & \ldots & a_{m,n} \\ \vdots & \ddots & \vdots \\ a_{{lda},1} & \ldots & a_{{lda},n} \\ \end{pmatrix}\) 
with its elements arranged linearly in memory as
\(\begin{pmatrix} a_{1,1} & a_{2,1} & \ldots & a_{m,1} & \ldots & a_{{lda},1} & \ldots & a_{1,n} & a_{2,n} & \ldots & a_{m,n} & \ldots & a_{{lda},n} \\ \end{pmatrix}\) 
where lda
≥ m
is the leading dimension of A
.
2.3.4. Matrix (CSR) Format
In CSR format the matrix is represented by the following parameters:
Parameter 
Type 
Size 
Meaning 



The number of rows (and columns) in the matrix. 



The number of nonzero elements in the matrix. 




The array of offsets corresponding to the start of each row in the arrays 



The array of column indices corresponding to the nonzero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. 



The array of values corresponding to the nonzero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. 
Note that in our CSR format, sparse matrices are assumed to be stored in rowmajor order, in other words, the index arrays are first sorted by row indices and then within each row by column indices. Also it is assumed that each pair of row and column indices appears only once.
For example, the 4x4
matrix
\(A = \begin{pmatrix} {1.0} & {3.0} & {0.0} & {0.0} \\ {0.0} & {4.0} & {6.0} & {0.0} \\ {2.0} & {5.0} & {7.0} & {8.0} \\ {0.0} & {0.0} & {0.0} & {9.0} \\ \end{pmatrix}\) 
is represented as
\({csrRowPtr} = \begin{pmatrix} 0 & 2 & 4 & 8 & 9 \\ \end{pmatrix}\) 
\({csrColInd} = \begin{pmatrix} 0 & 1 & 1 & 2 & 0 & 1 & 2 & 3 & 3 \\ \end{pmatrix}\) 
\({csrVal} = \begin{pmatrix} 1.0 & 3.0 & 4.0 & 6.0 & 2.0 & 5.0 & 7.0 & 8.0 & 9.0 \\ \end{pmatrix}\) 
2.3.5. Matrix (CSC) Format
In CSC format the matrix is represented by the following parameters:
Parameter 
Type 
Size 
Meaning 



The number of rows (and columns) in the matrix. 



The number of nonzero elements in the matrix. 




The array of offsets corresponding to the start of each column in the arrays 



The array of row indices corresponding to the nonzero elements in the matrix. It is assumed that this array is sorted by column and by row within each column. 



The array of values corresponding to the nonzero elements in the matrix. It is assumed that this array is sorted by column and by row within each column. 
Note that in our CSC format, sparse matrices are assumed to be stored in columnmajor order, in other words, the index arrays are first sorted by column indices and then within each column by row indices. Also it is assumed that each pair of row and column indices appears only once.
For example, the 4x4
matrix
\(A = \begin{pmatrix} {1.0} & {3.0} & {0.0} & {0.0} \\ {0.0} & {4.0} & {6.0} & {0.0} \\ {2.0} & {5.0} & {7.0} & {8.0} \\ {0.0} & {0.0} & {0.0} & {9.0} \\ \end{pmatrix}\) 
is represented as
\({cscColPtr} = \begin{pmatrix} 0 & 2 & 5 & 7 & 9 \\ \end{pmatrix}\) 
\({cscRowInd} = \begin{pmatrix} 0 & 2 & 0 & 1 & 2 & 1 & 2 & 2 & 3 \\ \end{pmatrix}\) 
\({cscVal} = \begin{pmatrix} 1.0 & 2.0 & 3.0 & 4.0 & 5.0 & 6.0 & 7.0 & 8.0 & 9.0 \\ \end{pmatrix}\) 
2.4. cuSolverDN: dense LAPACK Function Reference
This section describes the API of cuSolverDN, which provides a subset of dense LAPACK functions.
2.4.1. cuSolverDN Helper Function Reference
The cuSolverDN helper functions are described in this section.
2.4.1.1. cusolverDnCreate()
cusolverStatus_t
cusolverDnCreate(cusolverDnHandle_t *handle);
This function initializes the cuSolverDN library and creates a handle on the cuSolverDN context. It must be called before any other cuSolverDN API function is invoked. It allocates hardware resources necessary for accessing the GPU.
Parameter 
Memory 
In/out 
Meaning 




The pointer to the handle to the cuSolverDN context. 
Status Returned

The initialization succeeded. 

The CUDA Runtime initialization failed. 

The resources could not be allocated. 

The device only supports compute capability 2.0 and above. 
2.4.1.2. cusolverDnDestroy()
cusolverStatus_t
cusolverDnDestroy(cusolverDnHandle_t handle);
This function releases CPUside resources used by the cuSolverDN library.
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 
Status Returned

The shutdown succeeded. 

The library was not initialized. 
2.4.1.3. cusolverDnSetStream()
cusolverStatus_t
cusolverDnSetStream(cusolverDnHandle_t handle, cudaStream_t streamId)
This function sets the stream to be used by the cuSolverDN library to execute its routines.
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



The stream to be used by the library. 
Status Returned

The stream was set successfully. 

The library was not initialized. 
2.4.1.4. cusolverDnGetStream()
cusolverStatus_t
cusolverDnGetStream(cusolverDnHandle_t handle, cudaStream_t *streamId)
This function queries the stream to be used by the cuSolverDN library to execute its routines.
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



The stream which is used by 
Status Returned

The stream was set successfully. 

The library was not initialized. 
2.4.1.5. cusolverDnSetDeterministicMode()
cusolverStatus_t
cusolverDnSetDeterministicMode(cusolverDnHandle_t handle, cusolverDeterministicMode_t mode)
This function sets the determinstic mode of all cuSolverDN functions for handle
. For improved performance,
nondeterministic results can be allowed. Affected functions are cusolverDn<t>geqrf()
, cusolverDn<t>syevd()
, cusolverDn<t>syevdx()
, cusolverDn<t>gesvdj()
, cusolverDnXgeqrf()
, cusolverDnXsyevd()
, cusolverDnXsyevdx()
, cusolverDnXgesvdr()
and cusolverDnXgesvdp()
.
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



The deterministic mode to be used with 
Status Returned

The mode was set successfully. 

The library was not initialized. 

An internal error occurred. 
2.4.1.6. cusolverDnGetDeterministicMode()
cusolverStatus_t
cusolverDnGetDeterministicMode(cusolverDnHandle_t handle, cusolverDeterministicMode_t* mode)
This function queries the determinstic mode which is set for handle
.
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



The deterministic mode of 
Status Returned

The mode was set successfully. 

The library was not initialized. 


2.4.1.7. cusolverDnCreateSyevjInfo()
cusolverStatus_t
cusolverDnCreateSyevjInfo(
syevjInfo_t *info);
This function creates and initializes the structure of syevj
, syevjBatched
and sygvj
to default values.
Parameter 
Memory 
In/out 
Meaning 




The pointer to the structure of 
Status Returned

The structure was initialized successfully. 

The resources could not be allocated. 
2.4.1.8. cusolverDnDestroySyevjInfo()
cusolverStatus_t
cusolverDnDestroySyevjInfo(
syevjInfo_t info);
This function destroys and releases any memory required by the structure.
Parameter 
Memory 
In/out 
Meaning 




The structure of 
Status Returned

The resources are released successfully. 
2.4.1.9. cusolverDnXsyevjSetTolerance()
cusolverStatus_t
cusolverDnXsyevjSetTolerance(
syevjInfo_t info,
double tolerance)
This function configures tolerance of syevj
.
Parameter 
Memory 
In/out 
Meaning 




The pointer to the structure of 



accuracy of numerical eigenvalues. 
Status Returned

The operation completed successfully. 
2.4.1.10. cusolverDnXsyevjSetMaxSweeps()
cusolverStatus_t
cusolverDnXsyevjSetMaxSweeps(
syevjInfo_t info,
int max_sweeps)
This function configures maximum number of sweeps in syevj
. The default value is 100.
Parameter 
Memory 
In/out 
Meaning 




The pointer to the structure of 



Maximum number of sweeps. 
Status Returned

The operation completed successfully. 
2.4.1.11. cusolverDnXsyevjSetSortEig()
cusolverStatus_t
cusolverDnXsyevjSetSortEig(
syevjInfo_t info,
int sort_eig)
If sort_eig
is zero, the eigenvalues are not sorted. This function only works for syevjBatched
. syevj
and sygvj
always sort eigenvalues in ascending order. By default, eigenvalues are always sorted in ascending order.
Parameter 
Memory 
In/out 
Meaning 




The pointer to the structure of syevj. 



If 
Status Returned

The operation completed successfully. 
2.4.1.12. cusolverDnXsyevjGetResidual()
cusolverStatus_t
cusolverDnXsyevjGetResidual(
cusolverDnHandle_t handle,
syevjInfo_t info,
double *residual)
This function reports residual of syevj
or sygvj
. It does not support syevjBatched
. If the user calls this function after syevjBatched
, the error CUSOLVER_STATUS_NOT_SUPPORTED
is returned.
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



The pointer to the structure of 



Residual of 
Status Returned

The operation completed successfully. 

Does not support batched version. 
2.4.1.13. cusolverDnXsyevjGetSweeps()
cusolverStatus_t
cusolverDnXsyevjGetSweeps(
cusolverDnHandle_t handle,
syevjInfo_t info,
int *executed_sweeps)
This function reports number of executed sweeps of syevj
or sygvj
. It does not support syevjBatched
. If the user calls this function after syevjBatched
, the error CUSOLVER_STATUS_NOT_SUPPORTED
is returned.
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



The pointer to the structure of 



Number of executed sweeps. 
Status Returned

The operation completed successfully. 

Does not support batched version. 
2.4.1.14. cusolverDnCreateGesvdjInfo()
cusolverStatus_t
cusolverDnCreateGesvdjInfo(
gesvdjInfo_t *info);
This function creates and initializes the structure of gesvdj
and gesvdjBatched
to default values.
Parameter 
Memory 
In/out 
Meaning 




The pointer to the structure of 
Status Returned

The structure was initialized successfully. 

The resources could not be allocated. 
2.4.1.15. cusolverDnDestroyGesvdjInfo()
cusolverStatus_t
cusolverDnDestroyGesvdjInfo(
gesvdjInfo_t info);
This function destroys and releases any memory required by the structure.
Parameter 
Memory 
In/out 
Meaning 




The structure of 
Status Returned

The resources are released successfully. 
2.4.1.16. cusolverDnXgesvdjSetTolerance()
cusolverStatus_t
cusolverDnXgesvdjSetTolerance(
gesvdjInfo_t info,
double tolerance)
This function configures tolerance of gesvdj
.
Parameter 
Memory 
In/out 
Meaning 




The pointer to the structure of 



Accuracy of numerical singular values. 
Status Returned

The operation completed successfully. 
2.4.1.17. cusolverDnXgesvdjSetMaxSweeps()
cusolverStatus_t
cusolverDnXgesvdjSetMaxSweeps(
gesvdjInfo_t info,
int max_sweeps)
This function configures the maximum number of sweeps in gesvdj
. The default value is 100.
Parameter 
Memory 
In/out 
Meaning 




The pointer to the structure of 



Maximum number of sweeps. 
Status Returned

The operation completed successfully. 
2.4.1.18. cusolverDnXgesvdjSetSortEig()
cusolverStatus_t
cusolverDnXgesvdjSetSortEig(
gesvdjInfo_t info,
int sort_svd)
If sort_svd
is zero, the singular values are not sorted. This function only works for gesvdjBatched
. gesvdj
always sorts singular values in descending order. By default, singular values are always sorted in descending order.
Parameter 
Memory 
In/out 
Meaning 




The pointer to the structure of 



If 
Status Returned

The operation completed successfully. 
2.4.1.19. cusolverDnXgesvdjGetResidual()
cusolverStatus_t
cusolverDnXgesvdjGetResidual(
cusolverDnHandle_t handle,
gesvdjInfo_t info,
double *residual)
This function reports residual of gesvdj
. It does not support gesvdjBatched
. If the user calls this function after gesvdjBatched
, the error CUSOLVER_STATUS_NOT_SUPPORTED
is returned.
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



The pointer to the structure of 



Residual of 
Status Returned

The operation completed successfully. 

Does not support batched version 
2.4.1.20. cusolverDnXgesvdjGetSweeps()
cusolverStatus_t
cusolverDnXgesvdjGetSweeps(
cusolverDnHandle_t handle,
gesvdjInfo_t info,
int *executed_sweeps)
This function reports number of executed sweeps of gesvdj
. It does not support gesvdjBatched
. If the user calls this function after gesvdjBatched
, the error CUSOLVER_STATUS_NOT_SUPPORTED
is returned.
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



The pointer to the structure of 



Number of executed sweeps. 
Status Returned

The operation completed successfully. 

Does not support batched version 
2.4.1.21. cusolverDnIRSParamsCreate()
cusolverStatus_t
cusolverDnIRSParamsCreate(cusolverDnIRSParams_t *params);
This function creates and initializes the structure of parameters for an IRS solver such as the cusolverDnIRSXgesv()
or the cusolverDnIRSXgels()
functions to default values. The params structure created by this function can be used by one or more call to the same or to a different IRS solver. Note that in CUDA 10.2, the behavior was different and a new params
structure was needed to be created per each call to an IRS solver. Also note that the user can also change configurations of the params and then call a new IRS instance, but be careful that the previous call was done because any change to the configuration before the previous call was done could affect it.
Parameter 
Memory 
In/out 
Meaning 




Pointer to the 
Status Returned

The structure was created and initialized successfully. 

The resources could not be allocated. 
2.4.1.22. cusolverDnIRSParamsDestroy()
cusolverStatus_t
cusolverDnIRSParamsDestroy(cusolverDnIRSParams_t params);
This function destroys and releases any memory required by the Params structure.
Parameter 
Memory 
In/out 
Meaning 




The 
Status Returned

The resources are released successfully. 

The 

Not all the 
2.4.1.23. cusolverDnIRSParamsSetSolverPrecisions()
cusolverStatus_t
cusolverDnIRSParamsSetSolverPrecisions(
cusolverDnIRSParams_t params,
cusolverPrecType_t solver_main_precision,
cusolverPrecType_t solver_lowest_precision );
This function sets both the main and the lowest precision for the Iterative Refinement Solver (IRS). By main precision, we mean the precision of the Input and Output datatype. By lowest precision, we mean the solver is allowed to use as lowest computational precision during the LU factorization process. Note that the user has to set both the main and lowest precision before the first call to the IRS solver because they are NOT set by default with the params
structure creation, as it depends on the Input Output data type and user request. It is a wrapper to both cusolverDnIRSParamsSetSolverMainPrecision()
and cusolverDnIRSParamsSetSolverLowestPrecision()
. All possible combinations of main/lowest precision are described in the table below. Usually the lowest precision defines the speedup that can be achieved. The ratio of the performance of the lowest precision over the main precision (e.g., Inputs/Outputs datatype) define the upper bound of the speedup that could be obtained. More precisely, it depends on many factors, but for large matrices sizes, it is the ratio of the matrixmatrix rankk product (e.g., GEMM where K is 256 and M=N=size of the matrix) that define the possible speedup. For instance, if the inout precision is real double precision CUSOLVER_R_64F and the lowest precision is CUSOLVER_R_32F, then we can expect a speedup of at most 2X for large problem sizes. If the lowest precision was CUSOLVER_R_16F, then we can expect 3X4X. A reasonable strategy should take the number of righthand sides, the size of the matrix as well as the convergence rate into account.
Parameter 
Memory 
In/out 
Meaning 




The 



Allowed Inputs/Outputs datatype (for example CUSOLVER_R_FP64 for a real double precision data). See the table below for the supported precisions. 



Allowed lowest compute type (for example CUSOLVER_R_16F for half precision computation). See the table below for the supported precisions. 
Status Returned

The operation completed successfully. 

The 
Inputs/Outputs Data Type (e.g., main precision) 
Supported values for the lowest precision 









2.4.1.24. cusolverDnIRSParamsSetSolverMainPrecision()
cusolverStatus_t
cusolverDnIRSParamsSetSolverMainPrecision(
cusolverDnIRSParams_t params,
cusolverPrecType_t solver_main_precision);
This function sets the main precision for the Iterative Refinement Solver (IRS). By main precision, we mean, the type of the Input and Output data. Note that the user has to set both the main and lowest precision before a first call to the IRS solver because they are NOT set by default with the params
structure creation, as it depends on the Input Output data type and user request. user can set it by either calling this function or by calling cusolverDnIRSParamsSetSolverPrecisions()
which set both the main and the lowest precision together. All possible combinations of main/lowest precision are described in the table in the cusolverDnIRSParamsSetSolverPrecisions()
section above.
Parameter 
Memory 
In/out 
Meaning 




The 



Allowed Inputs/Outputs datatype (for example CUSOLVER_R_FP64 for a real double precision data). See the table in the 
Status Returned

The operation completed successfully. 

The 
2.4.1.25. cusolverDnIRSParamsSetSolverLowestPrecision()
cusolverStatus_t
cusolverDnIRSParamsSetSolverLowestPrecision(
cusolverDnIRSParams_t params,
cusolverPrecType_t lowest_precision_type);
This function sets the lowest precision that will be used by Iterative Refinement Solver. By lowest precision, we mean the solver is allowed to use as lowest computational precision during the LU factorization process. Note that the user has to set both the main and lowest precision before a first call to the IRS solver because they are NOT set by default with the params
structure creation, as it depends on the Input Output data type and user request. Usually the lowest precision defines the speedup that can be achieved. The ratio of the performance of the lowest precision over the main precision (e.g., Inputs/Outputs datatype) define somehow the upper bound of the speedup that could be obtained. More precisely, it depends on many factors, but for large matrices sizes, it is the ratio of the matrixmatrix rankk product (e.g., GEMM where K is 256 and M=N=size of the matrix) that define the possible speedup. For instance, if the inout precision is real double precision CUSOLVER_R_64F and the lowest precision is CUSOLVER_R_32F, then we can expect a speedup of at most 2X for large problem sizes. If the lowest precision was CUSOLVER_R_16F, then we can expect 3X4X. A reasonable strategy should take the number of righthand sides, the size of the matrix as well as the convergence rate into account.
Parameter 
Memory 
In/out 
Meaning 




The 



Allowed lowest compute type (for example CUSOLVER_R_16F for half precision computation). See the table in the 
Status Returned

The operation completed successfully. 

The Params structure was not created. 
2.4.1.26. cusolverDnIRSParamsSetRefinementSolver()
cusolverStatus_t
cusolverDnIRSParamsSetRefinementSolver(
cusolverDnIRSParams_t params,
cusolverIRSRefinement_t solver);
This function sets the refinement solver to be used in the Iterative Refinement Solver functions such as the cusolverDnIRSXgesv()
or the cusolverDnIRSXgels()
functions. Note that the user has to set the refinement algorithm before a first call to the IRS solver because it is NOT set by default with the creating of params. Details about values that can be set to and theirs meaning are described in the table below.
Parameter 
Memory 
In/out 
Meaning 




The 



Type of the refinement solver to be used by the IRS solver such as 
Status Returned

The operation completed successfully. 

The 

Solver is not set, this value is what is set when creating the params structure. IRS solver will return an error. 

No refinement solver; the IRS solver performs a factorization followed by a solve without any refinement. For example, if the IRS solver was 

Classical iterative refinement solver. Similar to the one used in LAPACK routines. 

GMRES (Generalized Minimal Residual) based iterative refinement solver. In recent study, the GMRES method has drawn the scientific community attention for its ability to be used as refinement solver that outperforms the classical iterative refinement method. Based on our experimentation, we recommend this setting. 

Classical iterative refinement solver that uses the GMRES (Generalized Minimal Residual) internally to solve the correction equation at each iteration. We call the classical refinement iteration the outer iteration while the GMRES is called inner iteration. Note that if the tolerance of the inner GMRES is set very low, let say to machine precision, then the outer classical refinement iteration will performs only one iteration and thus this option will behaves like CUSOLVER_IRS_REFINE_GMRES. 

Similar to 
2.4.1.27. cusolverDnIRSParamsSetTol()
cusolverStatus_t
cusolverDnIRSParamsSetTol(
cusolverDnIRSParams_t params,
double val );
This function sets the tolerance for the refinement solver. By default it is such that all the RHS satisfy:
RNRM < SQRT(N)*XNRM*ANRM*EPS*BWDMAX
where
RNRM is the infinitynorm of the residual
XNRM is the infinitynorm of the solution
ANRM is the infinityoperatornorm of the matrix A
EPS is the machine epsilon for the Inputs/Outputs datatype that matches LAPACK <X>LAMCH(‘Epsilon’)
BWDMAX, the value BWDMAX is fixed to 1.0
The user can use this function to change the tolerance to a lower or higher value. Our goal is to give the user more control such a way he can investigate and control every detail of the IRS solver. Note that the tolerance value is always in real double precision whatever the Inputs/Outputs datatype is.
Parameter 
Memory 
In/out 
Meaning 




The 



Double precision real value to which the refinement tolerance will be set. 
Status Returned

The operation completed successfully. 

The 
2.4.1.28. cusolverDnIRSParamsSetTolInner()
cusolverStatus_t
cusolverDnIRSParamsSetTolInner(
cusolverDnIRSParams_t params,
double val );
This function sets the tolerance for the inner refinement solver when the refinement solver consists of twolevels solver (e.g., CUSOLVER_IRS_REFINE_CLASSICAL_GMRES or CUSOLVER_IRS_REFINE_GMRES_GMRES cases). It is not referenced in case of one level refinement solver such as CUSOLVER_IRS_REFINE_CLASSICAL or CUSOLVER_IRS_REFINE_GMRES. It is set to 1e4 by default. This function set the tolerance for the inner solver (e.g. the inner GMRES). For example, if the Refinement Solver was set to CUSOLVER_IRS_REFINE_CLASSICAL_GMRES, setting this tolerance mean that the inner GMRES solver will converge to that tolerance at each outer iteration of the classical refinement solver. Our goal is to give the user more control such a way he can investigate and control every detail of the IRS solver. Note the, the tolerance value is always in real double precision whatever the Inputs/Outputs datatype is.
Parameter 
Memory 
In/out 
Meaning 




The 



Double precision real value to which the tolerance of the inner refinement solver will be set. 
Status Returned

The operation completed successfully. 

The 
2.4.1.29. cusolverDnIRSParamsSetMaxIters()
cusolverStatus_t
cusolverDnIRSParamsSetMaxIters(
cusolverDnIRSParams_t params,
int max_iters);
This function sets the total number of allowed refinement iterations after which the solver will stop. Total means any iteration which means the sum of the outer and the inner iterations (inner is meaningful when twolevels refinement solver is set). Default value is set to 50. Our goal is to give the user more control such a way he can investigate and control every detail of the IRS solver.
Parameter 
Memory 
In/out 
Meaning 




The 



Maximum total number of iterations allowed for the refinement solver. 
Status Returned

The operation completed successfully. 

The 
2.4.1.30. cusolverDnIRSParamsSetMaxItersInner()
cusolverStatus_t
cusolverDnIRSParamsSetMaxItersInner(
cusolverDnIRSParams_t params,
cusolver_int_t maxiters_inner );
This function sets the maximal number of iterations allowed for the inner refinement solver. It is not referenced in case of one level refinement solver such as CUSOLVER_IRS_REFINE_CLASSICAL or CUSOLVER_IRS_REFINE_GMRES. The inner refinement solver will stop after reaching either the inner tolerance or the MaxItersInner value. By default, it is set to 50. Note that this value could not be larger than the MaxIters since MaxIters is the total number of allowed iterations. Note that if the user calls cusolverDnIRSParamsSetMaxIters
after calling this function, SetMaxIters
has priority and will overwrite MaxItersInner
to the minimum value of (MaxIters, MaxItersInner)
.
Parameter 
Memory 
In/out 
Meaning 




The 



Maximum number of allowed inner iterations for the inner refinement solver. Meaningful when the refinement solver is a twolevels solver such as CUSOLVER_IRS_REFINE_CLASSICAL_GMRES or CUSOLVER_IRS_REFINE_GMRES_GMRES. Value should be less or equal to 
Status Returned

The operation completed successfully. 

The 

If the value was larger than 
2.4.1.31. cusolverDnIRSParamsEnableFallback()
cusolverStatus_t
cusolverDnIRSParamsEnableFallback(
cusolverDnIRSParams_t params );
This function enable the fallback to the main precision in case the Iterative Refinement Solver (IRS) failed to converge. In other term, if the IRS solver failed to converge, the solver will return a no convergence code (e.g., niter
< 0), but can either return the nonconvergent solution as it is (e.g., disable fallback) or can fallback (e.g., enable fallback) to the main precision (which is the precision of the Inputs/Outputs data) and solve the problem from scratch returning the good solution. This is the behavior by default, and it will guarantee that the IRS solver always provide the good solution. This function is provided because we provided cusolverDnIRSParamsDisableFallback
which allows the user to disable the fallback and thus this function allow the user to reenable it.
Parameter 
Memory 
In/out 
Meaning 




The 
Status Returned

The operation completed successfully. 

The 
2.4.1.32. cusolverDnIRSParamsDisableFallback()
cusolverStatus_t
cusolverDnIRSParamsDisableFallback(
cusolverDnIRSParams_t params );
This function disables the fallback to the main precision in case the Iterative Refinement Solver (IRS) failed to converge. In other term, if the IRS solver failed to converge, the solver will return a no convergence code (e.g., niter
< 0), but can either return the nonconvergent solution as it is (e.g., disable fallback) or can fallback (e.g., enable fallback) to the main precision (which is the precision of the Inputs/Outputs data) and solve the problem from scratch returning the good solution. This function disables the fallback and the returned solution is whatever the refinement solver was able to reach before it returns. Disabling fallback does not guarantee that the solution is the good one. However, if users want to keep getting the solution of the lower precision in case the IRS did not converge after certain number of iterations, they need to disable the fallback. The user can reenable it by calling cusolverDnIRSParamsEnableFallback
.
Parameter 
Memory 
In/out 
Meaning 




The 
Status Returned

The operation completed successfully. 

The 
2.4.1.33. cusolverDnIRSParamsGetMaxIters()
cusolverStatus_t
cusolverDnIRSParamsGetMaxIters(
cusolverDnIRSParams_t params,
cusolver_int_t *maxiters );
This function returns the current setting in the params
structure for the maximal allowed number of iterations (e.g., either the default MaxIters
, or the one set by the user in case he set it using cusolverDnIRSParamsSetMaxIters
). Note that this function returns the current setting in the params
configuration and not to be confused with the cusolverDnIRSInfosGetMaxIters
which return the maximal allowed number of iterations for a particular call to an IRS solver. To be clearer, the params
structure can be used for many calls to an IRS solver. A user can change the allowed MaxIters
between calls while the Infos
structure in cusolverDnIRSInfosGetMaxIters
contains information about a particular call and cannot be reused for different calls, and thus, cusolverDnIRSInfosGetMaxIters
returns the allowed MaxIters
for that call.
Parameter 
Memory 
In/out 
Meaning 




The 



The maximal number of iterations that is currently set. 
Status Returned

The operation completed successfully. 

The 
2.4.1.34. cusolverDnIRSInfosCreate()
cusolverStatus_t
cusolverDnIRSInfosCreate(
cusolverDnIRSInfos_t* infos )
This function creates and initializes the Infos
structure that will hold the refinement information of an Iterative Refinement Solver (IRS) call. Such information includes the total number of iterations that was needed to converge (Niters
), the outer number of iterations (meaningful when twolevels preconditioner such as CUSOLVER_IRS_REFINE_CLASSICAL_GMRES is used ), the maximal number of iterations that was allowed for that call, and a pointer to the matrix of the convergence history residual norms. The Infos
structure needs to be created before a call to an IRS solver. The Infos
structure is valid for only one call to an IRS solver, since it holds info about that solve and thus each solve will requires its own Infos
structure.
Parameter 
Memory 
In/out 
Meaning 




Pointer to the 
Status Returned

The structure was initialized successfully. 

The resources could not be allocated. 
2.4.1.35. cusolverDnIRSInfosDestroy()
cusolverStatus_t
cusolverDnIRSInfosDestroy(
cusolverDnIRSInfos_t infos );
This function destroys and releases any memory required by the Infos
structure. This function destroys all the information (e.g., Niters performed, OuterNiters performed, residual history etc.) about a solver call; thus, this function should only be called after the user is finished with the information.
Parameter 
Memory 
In/out 
Meaning 




The 
Status Returned

The resources are released successfully. 

The 
2.4.1.36. cusolverDnIRSInfosGetMaxIters()
cusolverStatus_t
cusolverDnIRSInfosGetMaxIters(
cusolverDnIRSInfos_t infos,
cusolver_int_t *maxiters );
This function returns the maximal allowed number of iterations that was set for the corresponding call to the IRS solver. Note that this function returns the setting that was set when that call happened and is not to be confused with the cusolverDnIRSParamsGetMaxIters
which returns the current setting in the params
configuration structure. To be clearer, the params
structure can be used for many calls to an IRS solver. A user can change the allowed MaxIters
between calls while the Infos
structure in cusolverDnIRSInfosGetMaxIters
contains information about a particular call and cannot be reused for different calls, thus cusolverDnIRSInfosGetMaxIters
returns the allowed MaxIters
for that call.
Parameter 
Memory 
In/out 
Meaning 




The 



The maximal number of iterations that is currently set. 
Status Returned

The operation completed successfully. 

The 
2.4.1.37. cusolverDnIRSInfosGetNiters()
cusolverStatus_t cusolverDnIRSInfosGetNiters(
cusolverDnIRSInfos_t infos,
cusolver_int_t *niters );
This function returns the total number of iterations performed by the IRS solver. If it was negative, it means that the IRS solver did not converge and if the user did not disable the fallback to full precision, then the fallback to a full precision solution happened and solution is good. Please refer to the description of negative niters
values in the corresponding IRS linear solver functions such as cusolverDnXgesv()
or cusolverDnXgels()
.
Parameter 
Memory 
In/out 
Meaning 




The 



The total number of iterations performed by the IRS solver. 
Status Returned

The operation completed successfully. 

The 
2.4.1.38. cusolverDnIRSInfosGetOuterNiters()
cusolverStatus_t
cusolverDnIRSInfosGetOuterNiters(
cusolverDnIRSInfos_t infos,
cusolver_int_t *outer_niters );
This function returns the number of iterations performed by the outer refinement loop of the IRS solver. When the refinement solver consists of a onelevel solver such as CUSOLVER_IRS_REFINE_CLASSICAL or CUSOLVER_IRS_REFINE_GMRES, it is the same as Niters
. When the refinement solver consists of a twolevels solver such as CUSOLVER_IRS_REFINE_CLASSICAL_GMRES or CUSOLVER_IRS_REFINE_GMRES_GMRES, it is the number of iterations of the outer loop. Refer to the description of the `cusolverIRSRefinement_t
<index.html#cusolverIRSRefinement>`__ section for more details.
Parameter 
Memory 
In/out 
Meaning 




The 



The number of iterations of the outer refinement loop of the IRS solver. 
Status Returned

The operation completed successfully. 

The 
2.4.1.39. cusolverDnIRSInfosRequestResidual()
cusolverStatus_t cusolverDnIRSInfosRequestResidual(
cusolverDnIRSInfos_t infos );
This function tells the IRS solver to store the convergence history (residual norms) of the refinement phase in a matrix that can be accessed via a pointer returned by the `cusolverDnIRSInfosGetResidualHistory()
<index.html#cusolverDnIRSInfosGetResidualHistory>`__ function.
Parameter 
Memory 
In/out 
Meaning 




The 
Status Returned

The operation completed successfully. 

The 
2.4.1.40. cusolverDnIRSInfosGetResidualHistory()
cusolverStatus_t
cusolverDnIRSInfosGetResidualHistory(
cusolverDnIRSInfos_t infos,
void **residual_history );
If the user called cusolverDnIRSInfosRequestResidual()
before the call to the IRS function, then the IRS solver will store the convergence history (residual norms) of the refinement phase in a matrix that can be accessed via a pointer returned by this function. The datatype of the residual norms depends on the input and output data type. If the Inputs/Outputs datatype is double precision real or complex (CUSOLVER_R_FP64 or CUSOLVER_C_FP64), this residual will be of type real double precision (FP64) double, otherwise if the Inputs/Outputs datatype is single precision real or complex (CUSOLVER_R_FP32 or CUSOLVER_C_FP32), this residual will be real single precision FP32 float.
The residual history matrix consists of two columns (even for the multiple righthand side case NRHS) of MaxIters+1
row, thus a matrix of size (MaxIters+1,2
). Only the first OuterNiters+1
rows contains the residual norms the other (e.g., OuterNiters+2:Maxiters+1) are garbage. On the first column, each row “i” specify the total number of iterations happened till this outer iteration “i” and on the second columns the residual norm corresponding to this outer iteration “i”. Thus, the first row (e.g., outer iteration “0”) consists of the initial residual (e.g., the residual before the refinement loop start) then the consecutive rows are the residual obtained at each outer iteration of the refinement loop. Note, it only consists of the history of the outer loop.
If the refinement solver was CUSOLVER_IRS_REFINE_CLASSICAL or CUSOLVER_IRS_REFINE_GMRES, then OuterNiters=Niters (Niters is the total number of iterations performed) and there is Niters+1 rows of norms that correspond to the Niters outer iterations.
If the refinement solver was CUSOLVER_IRS_REFINE_CLASSICAL_GMRES or CUSOLVER_IRS_REFINE_GMRES_GMRES, then OuterNiters <= Niters corresponds to the outer iterations performed by the outer refinement loop. Thus, there is OuterNiters+1 residual norms where row “i” correspond to the outer iteration “i” and the first column specify the total number of iterations (outer and inner) that were performed till this step the second columns correspond to the residual norm at this step.
For example, let’s say the user specifies CUSOLVER_IRS_REFINE_CLASSICAL_GMRES as a refinement solver and say it needed 3 outer iterations to converge and 4,3,3 inner iterations at each outer, respectively. This consists of 10 total iterations. Row 0 corresponds to the first residual before the refinement start, so it has 0 in its first column. On row 1 which corresponds to the outer iteration 1, it will be 4 (4 is the total number of iterations that were performed till now), on row 2 it will be 7, and on row 3 it will be 10.
In summary, let’s define ldh=Maxiters+1
, the leading dimension of the residual matrix. then residual_history[i]
shows the total number of iterations performed at the outer iteration “i” and residual_history[i+ldh]
corresponds to the norm of the residual at this outer iteration.
Parameter 
Memory 
In/out 
Meaning 




The 



Returns a void pointer to the matrix of the convergence history residual norms. See the description above for the relation between the residual norm datatype and the inout datatype. 
Status Returned

The operation completed successfully. 

The 

This function was called without calling 
2.4.1.41. cusolverDnCreateParams()
cusolverStatus_t
cusolverDnCreateParams(
cusolverDnParams_t *params);
This function creates and initializes the structure of 64bit API
to default values.
Parameter 
Memory 
In/out 
Meaning 




The pointer to the structure of 
Status Returned

The structure was initialized successfully. 

The resources could not be allocated. 
2.4.1.42. cusolverDnDestroyParams()
cusolverStatus_t
cusolverDnDestroyParams(
cusolverDnParams_t params);
This function destroys and releases any memory required by the structure.
Parameter 
Memory 
In/out 
Meaning 




The structure of 
Status Returned

The resources were released successfully. 
2.4.1.43. cusolverDnSetAdvOptions()
cusolverStatus_t
cusolverDnSetAdvOptions (
cusolverDnParams_t params,
cusolverDnFunction_t function,
cusolverAlgMode_t algo );
This function configures algorithm algo
of function
, a 64bit API
routine.
Parameter 
Memory 
In/out 
Meaning 




The pointer to the structure of 



The routine to be configured. 



The algorithm to be configured. 
Status Returned

The operation completed successfully. 

Wrong combination of 
2.4.2. Dense Linear Solver Reference (legacy)
This section describes linear solver API of cuSolverDN, including Cholesky factorization, LU with partial pivoting, QR factorization and BunchKaufman (LDLT) factorization.
2.4.2.1. cusolverDn<t>potrf()
These helper functions calculate the necessary size of work buffers.
cusolverStatus_t
cusolverDnSpotrf_bufferSize(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnDpotrf_bufferSize(cusolveDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnCpotrf_bufferSize(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnZpotrf_bufferSize(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
int *Lwork);
The S and D data types are real valued single and double precision, respectively.
cusolverStatus_t
cusolverDnSpotrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
float *Workspace,
int Lwork,
int *devInfo );
cusolverStatus_t
cusolverDnDpotrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
double *Workspace,
int Lwork,
int *devInfo );
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnCpotrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
cuComplex *Workspace,
int Lwork,
int *devInfo );
cusolverStatus_t
cusolverDnZpotrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
cuDoubleComplex *Workspace,
int Lwork,
int *devInfo );
This function computes the Cholesky factorization of a Hermitian positivedefinite matrix.
A
is an n×n
Hermitian matrix, only the lower or upper part is meaningful. The input parameter uplo
indicates which part of the matrix is used. The function would leave other parts untouched.
If input parameter uplo
is CUBLAS_FILL_MODE_LOWER
, only the lower triangular part of A
is processed, and replaced by the lower triangular Cholesky factor L
.
\(A = L*L^{H}\) 
If input parameter uplo
is CUBLAS_FILL_MODE_UPPER
, only upper triangular part of A
is processed, and replaced by upper triangular Cholesky factor U
.
\(A = U^{H}*U\) 
The user has to provide working space which is pointed by input parameter Workspace
. The input parameter Lwork
is size of the working space, and it is returned by potrf_bufferSize()
.
If Cholesky factorization failed, i.e. some leading minor of A
is not positive definite, or equivalently some diagonal elements of L
or U
is not a real number. The output parameter devInfo
would indicate smallest leading minor of A
which is not positive definite.
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
API of potrf
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Indicates if matrix 



Number of rows and columns of matrix 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



Working space, <type> array of size 



Size of 



If 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.2. cusolverDnPotrf()[DEPRECATED]
[[DEPRECATED]] use cusolverDnXpotrf()
instead. The routine will be removed in the next major release.
The helper functions below can calculate the sizes needed for preallocated buffer.
cusolverStatus_t
cusolverDnPotrf_bufferSize(
cusolverDnHandle_t handle,
cusolverDnParams_t params,
cublasFillMode_t uplo,
int64_t n,
cudaDataType dataTypeA,
const void *A,
int64_t lda,
cudaDataType computeType,
size_t *workspaceInBytes )
The routine bellow
cusolverStatus_t
cusolverDnPotrf(
cusolverDnHandle_t handle,
cusolverDnParams_t params,
cublasFillMode_t uplo,
int64_t n,
cudaDataType dataTypeA,
void *A,
int64_t lda,
cudaDataType computeType,
void *pBuffer,
size_t workspaceInBytes,
int *info )
Computes the Cholesky factorization of a Hermitian positivedefinite matrix using the generic API interfacte.
A
is an n×n
Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo
indicates which part of the matrix is used. The function would leave other part untouched.
If input parameter uplo
is CUBLAS_FILL_MODE_LOWER
, only lower triangular part of A
is processed, and replaced by lower triangular Cholesky factor L
.
\(A = L*L^{H}\) 
If input parameter uplo
is CUBLAS_FILL_MODE_UPPER
, only upper triangular part of A
is processed, and replaced by upper triangular Cholesky factor U
.
\(A = U^{H}*U\) 
The user has to provide working space which is pointed by input parameter pBuffer
. The input parameter workspaceInBytes
is size in bytes of the working space, and it is returned by cusolverDnPotrf_bufferSize()
.
If Cholesky factorization failed, i.e. some leading minor of A
is not positive definite, or equivalently some diagonal elements of L
or U
is not a real number. The output parameter info
would indicate smallest leading minor of A
which is not positive definite.
If output parameter info = i
(less than zero), the ith
parameter is wrong (not counting handle).
Currently, cusolverDnPotrf
supports only the default algorithm.
Table of algorithms supported by cusolverDnPotrf

Default algorithm. 
List of input arguments for cusolverDnPotrf_bufferSize
and cusolverDnPotrf
:
API of potrf
Parameter 
Memory 
In/out 
Meaning 




handle to the cuSolverDN library context. 



structure with information collected by 



indicates if matrix 



number of rows and columns of matrix 



data type of array 



array of dimension 



leading dimension of twodimensional array used to store matrix 



data type of computation. 



Working space. Array of type 



size in bytes of 



if 
The generic API has two different types, dataTypeA
is data type of the matrix A
, computeType
is compute type of the operation. cusolverDnPotrf
only supports the following four combinations.
Valid combination of data type and compute type















Status Returned

the operation completed successfully. 

the library was not initialized. 

invalid parameters were passed ( 

the device only supports compute capability 2.0 and above. 

an internal operation failed. 
2.4.2.3. cusolverDn<t>potrs()
cusolverStatus_t
cusolverDnSpotrs(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
const float *A,
int lda,
float *B,
int ldb,
int *devInfo);
cusolverStatus_t
cusolverDnDpotrs(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
const double *A,
int lda,
double *B,
int ldb,
int *devInfo);
cusolverStatus_t
cusolverDnCpotrs(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
const cuComplex *A,
int lda,
cuComplex *B,
int ldb,
int *devInfo);
cusolverStatus_t
cusolverDnZpotrs(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
const cuDoubleComplex *A,
int lda,
cuDoubleComplex *B,
int ldb,
int *devInfo);
This function solves a system of linear equations
\(A*X = B\) 
where A
is an n×n
Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo
indicates which part of the matrix is used. The function would leave other part untouched.
The user has to call potrf
first to factorize matrix A
. If input parameter uplo
is CUBLAS_FILL_MODE_LOWER
, A
is lower triangular Cholesky factor L
correspoding to \(A = L*L^{H}\) . If input parameter uplo
is CUBLAS_FILL_MODE_UPPER
, A
is upper triangular Cholesky factor U
corresponding to \(A = U^{H}*U\) .
The operation is inplace, i.e. matrix X
overwrites matrix B
with the same leading dimension ldb
.
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
API of potrs
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolveDN library context. 



Indicates if matrix 



Number of rows and columns of matrix 



Number of columns of matrix 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



<type> array of dimension 



If 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.4. cusolverDnPotrs()[DEPRECATED]
[[DEPRECATED]] use cusolverDnXpotrs()
instead. The routine will be removed in the next major release.
cusolverStatus_t
cusolverDnPotrs(
cusolverDnHandle_t handle,
cusolverDnParams_t params,
cublasFillMode_t uplo,
int64_t n,
int64_t nrhs,
cudaDataType dataTypeA,
const void *A,
int64_t lda,
cudaDataType dataTypeB,
void *B,
int64_t ldb,
int *info)
This function solves a system of linear equations
\(A*X = B\) 
where A
is a n×n
Hermitian matrix, only lower or upper part is meaningful using the generic API interface. The input parameter uplo
indicates which part of the matrix is used. The function would leave other part untouched.
The user has to call cusolverDnPotrf
first to factorize matrix A
. If input parameter uplo
is CUBLAS_FILL_MODE_LOWER
, A
is lower triangular Cholesky factor L
correspoding to \(A = L*L^{H}\) . If input parameter uplo
is CUBLAS_FILL_MODE_UPPER
, A
is upper triangular Cholesky factor U
corresponding to \(A = U^{H}*U\) .
The operation is inplace, i.e. matrix X
overwrites matrix B
with the same leading dimension ldb
.
If output parameter info = i
(less than zero), the ith
parameter is wrong (not counting handle).
Currently, cusolverDnPotrs
supports only the default algorithm.
Table of algorithms supported by cusolverDnPotrs

Default algorithm. 
List of input arguments for cusolverDnPotrs
:
API of potrs
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolveDN library context. 



Structure with information collected by 



Indicates if matrix 



Number of rows and columns of matrix 



Number of columns of matrix 



Data type of array 



Array of dimension 



Leading dimension of twodimensional array used to store matrix 



Data type of array 



Array of dimension 



If 
The generic API has two different types, dataTypeA
is data type of the matrix A
, dataTypeB
is data type of the matrix B
. cusolverDnPotrs
only supports the following four combinations.
Valid combination of data type and compute type















Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.5. cusolverDn<t>potri()
These helper functions calculate the necessary size of work buffers.
cusolverStatus_t
cusolverDnSpotri_bufferSize(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnDpotri_bufferSize(cusolveDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnCpotri_bufferSize(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnZpotri_bufferSize(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
int *Lwork);
The S and D data types are real valued single and double precision, respectively.
cusolverStatus_t
cusolverDnSpotri(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
float *Workspace,
int Lwork,
int *devInfo );
cusolverStatus_t
cusolverDnDpotri(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
double *Workspace,
int Lwork,
int *devInfo );
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnCpotri(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
cuComplex *Workspace,
int Lwork,
int *devInfo );
cusolverStatus_t
cusolverDnZpotri(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
cuDoubleComplex *Workspace,
int Lwork,
int *devInfo );
This function computes the inverse of a positivedefinite matrix A
using the Cholesky factorization
\(A = L*L^{H} = U^{H}*U\) 
computed by potrf()
.
A
is a n×n
matrix containing the triangular factor L
or U
computed by the Cholesky factorization. Only lower or upper part is meaningful and the input parameter uplo
indicates which part of the matrix is used. The function would leave the other part untouched.
If the input parameter uplo
is CUBLAS_FILL_MODE_LOWER
, only lower triangular part of A
is processed, and replaced the by lower triangular part of the inverse of A
.
If the input parameter uplo
is CUBLAS_FILL_MODE_UPPER
, only upper triangular part of A
is processed, and replaced by the upper triangular part of the inverse of A
.
The user has to provide the working space which is pointed to by input parameter Workspace
. The input parameter Lwork
is the size of the working space, returned by potri_bufferSize()
.
If the computation of the inverse fails, i.e. some leading minor of L
or U
, is null, the output parameter devInfo
would indicate the smallest leading minor of L
or U
which is not positive definite.
If the output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting the handle).
API of potri
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Indicates if matrix 



Number of rows and columns of matrix 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



Working space, <type> array of size 



Size of 



If 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.6. cusolverDn<t>getrf()
These helper functions calculate the size of work buffers needed.
Please visit cuSOLVER Library Samples  getrf for a code example.
cusolverStatus_t
cusolverDnSgetrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
float *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnDgetrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
double *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnCgetrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
cuComplex *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnZgetrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
cuDoubleComplex *A,
int lda,
int *Lwork );
The S and D data types are real single and double precision, respectively.
cusolverStatus_t
cusolverDnSgetrf(cusolverDnHandle_t handle,
int m,
int n,
float *A,
int lda,
float *Workspace,
int *devIpiv,
int *devInfo );
cusolverStatus_t
cusolverDnDgetrf(cusolverDnHandle_t handle,
int m,
int n,
double *A,
int lda,
double *Workspace,
int *devIpiv,
int *devInfo );
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnCgetrf(cusolverDnHandle_t handle,
int m,
int n,
cuComplex *A,
int lda,
cuComplex *Workspace,
int *devIpiv,
int *devInfo );
cusolverStatus_t
cusolverDnZgetrf(cusolverDnHandle_t handle,
int m,
int n,
cuDoubleComplex *A,
int lda,
cuDoubleComplex *Workspace,
int *devIpiv,
int *devInfo );
This function computes the LU factorization of a m×n
matrix
\(P*A = L*U\) 
where A
is a m×n
matrix, P
is a permutation matrix, L
is a lower triangular matrix with unit diagonal, and U
is an upper triangular matrix.
The user has to provide working space which is pointed by input parameter Workspace
. The input parameter Lwork
is size of the working space, and it is returned by getrf_bufferSize()
.
If LU factorization failed, i.e. matrix A
(U
) is singular, The output parameter devInfo=i
indicates U(i,i) = 0
.
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
If devIpiv
is null, no pivoting is performed. The factorization is A=L*U
, which is not numerically stable.
No matter LU factorization failed or not, the output parameter devIpiv
contains pivoting sequence, row i
is interchanged with row devIpiv(i)
.
The user can combine getrf
and getrs
to complete a linear solver.
Remark: getrf
uses fastest implementation with large workspace of size m*n
. The user can choose the legacy implementation with minimal workspace by Getrf
and cusolverDnSetAdvOptions(params, CUSOLVERDN_GETRF, CUSOLVER_ALG_1)
.
API of getrf
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Number of rows of matrix 



Number of columns of matrix 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



Working space, <type> array of size 



Array of size at least 



If 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.7. cusolverDnGetrf()[DEPRECATED]
[[DEPRECATED]] use cusolverDnXgetrf()
instead. The routine will be removed in the next major release.
The helper function below can calculate the sizes needed for preallocated buffer.
cusolverStatus_t
cusolverDnGetrf_bufferSize(
cusolverDnHandle_t handle,
cusolverDnParams_t params,
int64_t m,
int64_t n,
cudaDataType dataTypeA,
const void *A,
int64_t lda,
cudaDataType computeType,
size_t *workspaceInBytes )
The following function:
cusolverStatus_t
cusolverDnGetrf(
cusolverDnHandle_t handle,
cusolverDnParams_t params,
int64_t m,
int64_t n,
cudaDataType dataTypeA,
void *A,
int64_t lda,
int64_t *ipiv,
cudaDataType computeType,
void *pBuffer,
size_t workspaceInBytes,
int *info )
computes the LU factorization of a m×n
matrix
\(P*A = L*U\) 
where A
is an m×n
matrix, P
is a permutation matrix, L
is a lower triangular matrix with unit diagonal, and U
is an upper triangular matrix using the generic API interface.
If LU factorization failed, i.e. matrix A
(U
) is singular, The output parameter info=i
indicates U(i,i) = 0
.
If output parameter info = i
(less than zero), the ith
parameter is wrong (not counting handle).
If ipiv
is null, no pivoting is performed. The factorization is A=L*U
, which is not numerically stable.
No matter LU factorization failed or not, the output parameter ipiv
contains pivoting sequence, row i
is interchanged with row ipiv(i)
.
The user has to provide working space which is pointed by input parameter pBuffer
. The input parameter workspaceInBytes
is size in bytes of the working space, and it is returned by cusolverDnGetrf_bufferSize()
.
The user can combine cusolverDnGetrf
and cusolverDnGetrs
to complete a linear solver.
Currently, cusolverDnGetrf
supports two algorithms. To select legacy implementation, the user has to call cusolverDnSetAdvOptions
.
Table of algorithms supported by cusolverDnGetrf

Default algorithm. The fastest, requires a large workspace of 

Legacy implementation 
List of input arguments for cusolverDnGetrf_bufferSize
and cusolverDnGetrf
:
API of cusolverDnGetrf
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Structure with information collected by 



number of rows of matrix 



number of columns of matrix 



data type of array 



<type> array of dimension 



leading dimension of twodimensional array used to store matrix 



array of size at least 



data type of computation. 



Working space. Array of type 



size in bytes of 



if 
The generic API has two different types, dataTypeA
is data type of the matrix A
, computeType
is compute type of the operation. cusolverDnGetrf
only supports the following four combinations.
valid combination of data type and compute type















Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.8. cusolverDn<t>getrs()
Please visit cuSOLVER Library Samples  getrf for a code example.
cusolverStatus_t
cusolverDnSgetrs(cusolverDnHandle_t handle,
cublasOperation_t trans,
int n,
int nrhs,
const float *A,
int lda,
const int *devIpiv,
float *B,
int ldb,
int *devInfo );
cusolverStatus_t
cusolverDnDgetrs(cusolverDnHandle_t handle,
cublasOperation_t trans,
int n,
int nrhs,
const double *A,
int lda,
const int *devIpiv,
double *B,
int ldb,
int *devInfo );
cusolverStatus_t
cusolverDnCgetrs(cusolverDnHandle_t handle,
cublasOperation_t trans,
int n,
int nrhs,
const cuComplex *A,
int lda,
const int *devIpiv,
cuComplex *B,
int ldb,
int *devInfo );
cusolverStatus_t
cusolverDnZgetrs(cusolverDnHandle_t handle,
cublasOperation_t trans,
int n,
int nrhs,
const cuDoubleComplex *A,
int lda,
const int *devIpiv,
cuDoubleComplex *B,
int ldb,
int *devInfo );
This function solves a linear system of multiple righthand sides
\({op(A)}*X = B\) 
where A
is an n×n
matrix, and was LUfactored by getrf
, that is, lower trianular part of A is L
, and upper triangular part (including diagonal elements) of A
is U
. B
is a n×nrhs
righthand side matrix.
The input parameter trans
is defined by
\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if~}\textsf{trans\ ==\ CUBLAS\_OP\_N}} \\ A^{T} & {\text{if~}\textsf{trans\ ==\ CUBLAS\_OP\_T}} \\ A^{H} & {\text{if~}\textsf{trans\ ==\ CUBLAS\_OP\_C}} \\ \end{matrix} \right.\)
The input parameter devIpiv
is an output of getrf
. It contains pivot indices, which are used to permutate righthand sides.
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
The user can combine getrf
and getrs
to complete a linear solver.
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Operation 



Number of rows and columns of matrix 



Number of righthand sides. 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



Array of size at least 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



If 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.9. cusolverDnGetrs()[DEPRECATED]
[[DEPRECATED]] use cusolverDnXgetrs()
instead. The routine will be removed in the next major release.
cusolverStatus_t
cusolverDnGetrs(
cusolverDnHandle_t handle,
cusolverDnParams_t params,
cublasOperation_t trans,
int64_t n,
int64_t nrhs,
cudaDataType dataTypeA,
const void *A,
int64_t lda,
const int64_t *ipiv,
cudaDataType dataTypeB,
void *B,
int64_t ldb,
int *info )
This function solves a linear system of multiple righthand sides
\({op(A)}*X = B\) 
where A
is a n×n
matrix, and was LUfactored by cusolverDnGetrf
, that is, lower trianular part of A is L
, and upper triangular part (including diagonal elements) of A
is U
. B
is a n×nrhs
righthand side matrix using the generic API interface.
The input parameter trans
is defined by
\(\text{op}(A) = \left\{ \begin{matrix} A & {\text{if~}\textsf{trans\ ==\ CUBLAS\_OP\_N}} \\ A^{T} & {\text{if~}\textsf{trans\ ==\ CUBLAS\_OP\_T}} \\ A^{H} & {\text{if~}\textsf{trans\ ==\ CUBLAS\_OP\_C}} \\ \end{matrix} \right.\)
The input parameter ipiv
is an output of cusolverDnGetrf
. It contains pivot indices, which are used to permutate righthand sides.
If output parameter info = i
(less than zero), the ith
parameter is wrong (not counting handle).
The user can combine cusolverDnGetrf
and cusolverDnGetrs
to complete a linear solver.
Currently, cusolverDnGetrs
supports only the default algorithm.
Table of algorithms supported by cusolverDnGetrs

Default algorithm. 
List of input arguments for cusolverDnGetrss
:
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Structure with information collected by 



Operation 



Number of rows and columns of matrix 



Number of righthand sides. 



Data type of array 



Array of dimension 



Leading dimension of twodimensional array used to store matrix 



Array of size at least 



Data type of array 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



If 
The generic API has two different types, dataTypeA
is data type of the matrix A
and dataTypeB
is data type of the matrix B
. cusolverDnGetrs
only supports the following four combinations.
Valid combination of data type and compute type















Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.10. cusolverDn<t1><t2>gesv()
These functions are modelled after functions DSGESV and ZCGESV from LAPACK. They compute the solution of a system of linear equations with one or multiple right hand sides using mixed precision iterative refinement techniques based on the LU factorization Xgesv
. These functions are similar in term of functionalities to the full precision LU solver (Xgesv
, where X denotes Z,C,D,S) but it uses lower precision internally in order to provide faster time to solution, from here cames the name mixed precision. Mixed precision iterative refinement techniques means that the solver compute an LU factorization in lower precision and then iteratively refine the solution to achieve the accuracy of the Inputs/Outputs datatype precision. The <t1> corresponds to the Inputs/Outputs datatype precision while <t2> represent the internal lower precision at which the factorization will be carried on.
\(A \times X = B\) 
Where A
is nbyn
matrix and X
and B
are nbynrhs
matrices.
Functions API are designed to be as close as possible to LAPACK API to be considered as a quick and easy dropin replacement. Parameters and behavior are mostly the same as LAPACK counterparts. Description of these functions and differences from LAPACK is given below. <t1><t2>gesv()
functions are designated by two floating point precisions The <t1> corresponds to the main precision (e.g., Inputs/Outputs datatype precision) and the <t2> represent the internal lower precision at which the factorization will be carried on. cusolver<t1><t2>gesv()
first attempts to factorize the matrix in lower precision and use this factorization within an iterative refinement procedure to obtain a solution with same normwise backward error as the main precision <t1>. If the approach fails to converge, then the method fallback to the main precision factorization and solve (Xgesv) such a way that there is always a good solution at the output of these functions. If <t2> is equal to <t1>, then it is not a mixed precision process but rather a full one precision factorisation, solve and refinement within the same main precision.
The iterative refinement process is stopped if
ITER > ITERMAX
or for all the RHS we have:
RNRM < SQRT(N)*XNRM*ANRM*EPS*BWDMAX
where
ITER is the number of the current iteration in the iterative refinement process
RNRM is the infinitynorm of the residual
XNRM is the infinitynorm of the solution
ANRM is the infinityoperatornorm of the matrix A
EPS is the machine epsilon that matches LAPACK <t1>LAMCH(‘Epsilon’)
The value ITERMAX and BWDMAX are fixed to 50 and 1.0 respectively.
The function returns value describes the results of the solving process. A CUSOLVER_STATUS_SUCCESS indicates that the function finished with success otherwise, it indicates if one of the API arguments is incorrect, or if the function did not finish with success. More details about the error will be in the niters
and the dinfo
API parameters. See their description below for more details. User should provide the required workspace allocated on device memory. The amount of bytes required can be queried by calling the respective function <t1><t2>gesv_bufferSize()
.
Note that in addition to the two mixed precision functions available in LAPACK (e.g., dsgesv
and zcgesv
), we provide a large set of mixed precision functions that include half, bfloat and tensorfloat as a lower precision as well as same precision functions (e.g., main and lowest precision are equal <t2> is equal to <t1>). The following table specifies which precisions will be used for which interface function.
Tensor Float (TF32), introduced with NVIDIA Ampere Architecture GPUs, is the most robust tensor core accelerated compute mode for the iterative refinement solver. It is able to solve the widest range of problems in HPC arising from different applications and provides up to 4X and 5X speedup for real and complex systems, respectively. On Volta and Turing architecture GPUs, half precision tensor core acceleration is recommended. In cases where the iterative refinement solver fails to converge to the desired accuracy (main precision, INOUT data precision), it is recommended to use main precision as internal lowest precision (i.e., cusolverDn[DD,ZZ]gesv
for the FP64 case).
Interface function 
Main precision (matrix, rhs and solution datatype) 
Lowest precision allowed to be used internally 























































cusolverDn<t1><t2>gesv_bufferSize()
functions will return workspace buffer size in bytes required for the corresponding cusolverDn<t1><t2>gesv()
function.
cusolverStatus_t
cusolverDnZZgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
int * dipiv,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnZCgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
int * dipiv,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnZKgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
int * dipiv,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnZEgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
int * dipiv,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnZYgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
int * dipiv,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnCCgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
cuComplex * dA,
int ldda,
int * dipiv,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnCKgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
cuComplex * dA,
int ldda,
int * dipiv,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnCEgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
cuComplex * dA,
int ldda,
int * dipiv,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnCYgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
cuComplex * dA,
int ldda,
int * dipiv,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnDDgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
double * dA,
int ldda,
int * dipiv,
double * dB,
int lddb,
double * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnDSgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
double * dA,
int ldda,
int * dipiv,
double * dB,
int lddb,
double * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnDHgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
double * dA,
int ldda,
int * dipiv,
double * dB,
int lddb,
double * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnDBgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
double * dA,
int ldda,
int * dipiv,
double * dB,
int lddb,
double * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnDXgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
double * dA,
int ldda,
int * dipiv,
double * dB,
int lddb,
double * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnSSgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
float * dA,
int ldda,
int * dipiv,
float * dB,
int lddb,
float * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnSHgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
float * dA,
int ldda,
int * dipiv,
float * dB,
int lddb,
float * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnSBgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
float * dA,
int ldda,
int * dipiv,
float * dB,
int lddb,
float * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnSXgesv_bufferSize(
cusolverHandle_t handle,
int n,
int nrhs,
float * dA,
int ldda,
int * dipiv,
float * dB,
int lddb,
float * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
Parameters of cusolverDn<T1><T2>gesv_bufferSize() functions
Parameter 
Memory 
In/out 
Meaning 




Handle to the cusolverDN library context. 



Number of rows and columns of square matrix 



Number of right hand sides to solve. Should be nonnegative. 



Matrix 



Leading dimension of twodimensional array used to store matrix 



Pivoting sequence. Not used and can be 



Set of right hand sides 



Leading dimension of twodimensional array used to store matrix of right hand sides 



Set of soultion vectors 



Leading dimension of twodimensional array used to store matrix of solution vectors 



Pointer to device workspace. Not used and can be 



Pointer to a variable where required size of temporary workspace in bytes will be stored. Can’t be NULL. 
cusolverStatus_t cusolverDnZZgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
int * dipiv,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnZCgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
int * dipiv,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnZKgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
int * dipiv,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnZEgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
int * dipiv,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnZYgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
int * dipiv,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnCCgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
cuComplex * dA,
int ldda,
int * dipiv,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnCKgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
cuComplex * dA,
int ldda,
int * dipiv,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnCEgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
cuComplex * dA,
int ldda,
int * dipiv,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnCYgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
cuComplex * dA,
int ldda,
int * dipiv,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnDDgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
double * dA,
int ldda,
int * dipiv,
double * dB,
int lddb,
double * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnDSgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
double * dA,
int ldda,
int * dipiv,
double * dB,
int lddb,
double * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnDHgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
double * dA,
int ldda,
int * dipiv,
double * dB,
int lddb,
double * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnDBgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
double * dA,
int ldda,
int * dipiv,
double * dB,
int lddb,
double * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnDXgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
double * dA,
int ldda,
int * dipiv,
double * dB,
int lddb,
double * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnSSgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
float * dA,
int ldda,
int * dipiv,
float * dB,
int lddb,
float * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnSHgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
float * dA,
int ldda,
int * dipiv,
float * dB,
int lddb,
float * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnSBgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
float * dA,
int ldda,
int * dipiv,
float * dB,
int lddb,
float * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnSXgesv(
cusolverDnHandle_t handle,
int n,
int nrhs,
float * dA,
int ldda,
int * dipiv,
float * dB,
int lddb,
float * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
Parameters of cusolverDn<T1><T2>gesv() functions
Parameter 
Memory 
In/out 
Meaning 




Handle to the cusolverDN library context. 



Number of rows and columns of square matrix 



Number of right hand sides to solve. Should be nonnegative. 



Matrix 



Leading dimension of twodimensional array used to store matrix 



Vector that defines permutation for the factorization  row 



Set of right hand sides 



Leading dimension of twodimensional array used to store matrix of right hand sides 



Set of soultion vectors 



Leading dimension of twodimensional array used to store matrix of solution vectors 



Pointer to an allocated workspace in device memory of size 



Size of the allocated device workspace. Should be at least what was returned by 



If




Status of the IRS solver on the return. If 0  solve was successful. If 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed, for example:


The IRS solver supports compute capability 7.0 and above. The lowest precision options CUSOLVER_[CR]_16BF and CUSOLVER_[CR]_TF32 are only available on compute capability 8.0 and above. 



Numerical error related to niters <0, see 

An internal error occured, check the 
2.4.2.11. cusolverDnIRSXgesv()
This function is designed to perform same functionality as cusolverDn<T1><T2>gesv()
functions, but wrapped in a more generic and expert interface that gives user more control to parametrize the function as well as it provides more informations on output. cusolverDnIRSXgesv()
allows additional control of the solver parameters such as setting:
the main precision (Inputs/Outputs precision) of the solver
the lowest precision to be used internally by the solver
the refinement solver type
the maximum allowed number of iterations in the refinement phase
the tolerance of the refinement solver
the fallback to main precision
and more
through the configuration parameters structure gesv_irs_params
and its helper functions. For more details about what configuration can be set and its meaning please refer to all the functions in the cuSolverDN Helper Function Section that start with cusolverDnIRSParamsxxxx()
. Moreover, cusolverDnIRSXgesv()
provides additional informations on the output such as the convergence history (e.g., the residual norms) at each iteration and the number of iterations needed to converge. For more details about what informations can be retrieved and its meaning please refer to all the functions in the cuSolverDN Helper Function Section that start with cusolverDnIRSInfosxxxx()
The function returns value describes the results of the solving process. A CUSOLVER_STATUS_SUCCESS indicates that the function finished with success otherwise, it indicates if one of the API arguments is incorrect, or if the configurations of params/infos structure is incorrect or if the function did not finish with success. More details about the error can be found by checking the niters
and the dinfo
API parameters. See their description below for further details. User should provide the required workspace allocated on device for the cusolverDnIRSXgesv()
function. The amount of bytes required for the function can be queried by calling the respective function cusolverDnIRSXgesv_bufferSize()
. Note that, if the user would like a praticular configuration to be set via the params structure, it should be set before the call to cusolverDnIRSXgesv_bufferSize()
to get the size of the required workspace.
Tensor Float (TF32), introduced with NVIDIA Ampere Architecture GPUs, is the most robust tensor core accelerated compute mode for the iterative refinement solver. It is able to solve the widest range of problems in HPC arising from different applications and provides up to 4X and 5X speedup for real and complex systems, respectively. On Volta and Turing architecture GPUs, half precision tensor core acceleration is recommended. In cases where the iterative refinement solver fails to converge to the desired accuracy (main precision, INOUT data precision), it is recommended to use main precision as internal lowest precision.
The following table provides all possible combinations values for the lowest precision corresponding to the Inputs/Outputs data type. Note that if the lowest precision matches the Inputs/Outputs datatype, then the main precision factorization will be used.
Inputs/Outputs Data Type (e.g., main precision) 
Supported values for the lowest precision 









The cusolverDnIRSXgesv_bufferSize()
function returns the required workspace buffer size in bytes for the corresponding cusolverDnXgesv()
call with the given gesv_irs_params
configuration.
cusolverStatus_t
cusolverDnIRSXgesv_bufferSize(
cusolverDnHandle_t handle,
cusolverDnIRSParams_t gesv_irs_params,
cusolver_int_t n,
cusolver_int_t nrhs,
size_t * lwork_bytes);
Parameter 
Memory 
In/out 
Meaning 




Handle to the cusolverDn library context. 







Number of rows and columns of the square matrix 



Number of right hand sides to solve. Should be nonnegative. Note that 



Pointer to a variable, where the required size in bytes, of the workspace will be stored after a call to 
cusolverStatus_t cusolverDnIRSXgesv(
cusolverDnHandle_t handle,
cusolverDnIRSParams_t gesv_irs_params,
cusolverDnIRSInfos_t gesv_irs_infos,
int n,
int nrhs,
void * dA,
int ldda,
void * dB,
int lddb,
void * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * dinfo);
Parameter 
Memory 
In/out 
Meaning 




Handle to the cusolverDn library context. 



Configuration parameters structure, can serve one or more calls to any IRS solver 



Info structure, where information about a particular solve will be stored. The 



Number of rows and columns of square matrix 



Number of right hand sides to solve. Should be nonnegative. Note that, 



Matrix 



Leading dimension of twodimensional array used to store matrix 



Set of right hand sides 



Leading dimension of twodimensional array used to store matrix of right hand sides 



Set of soultion vectors 



Leading dimension of twodimensional array used to store matrix of solution vectors 



Pointer to an allocated workspace in device memory of size lwork_bytes. 



Size of the allocated device workspace. Should be at least what was returned by 



If iter is




Status of the IRS solver on the return. If 0  solve was successful. If dinfo =  
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed, for example:


The IRS solver supports compute capability 7.0 and above. The lowest precision options CUSOLVER_[CR]_16BF and CUSOLVER_[CR]_TF32 are only available on compute capability 8.0 and above. 



Numerical error related to niters <0, see niters description for more details. 

An internal error occured, check the 

The configuration parameter 

One of the configuration parameter in the 

The main and/or the lowest precision configuration parameter in the 

The maxiter configuration parameter in the 

The refinement solver configuration parameter in the 

One of the configuration parameter in the 

The information structure 

CPU memory allocation failed, most likely during the allocation of the residual array that store the residual norms. 
2.4.2.12. cusolverDn<t>geqrf()
These helper functions calculate the size of work buffers needed.
cusolverStatus_t
cusolverDnSgeqrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
float *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnDgeqrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
double *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnCgeqrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
cuComplex *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnZgeqrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
cuDoubleComplex *A,
int lda,
int *Lwork );
The S and D data types are real valued single and double precision, respectively.
cusolverStatus_t
cusolverDnSgeqrf(cusolverDnHandle_t handle,
int m,
int n,
float *A,
int lda,
float *TAU,
float *Workspace,
int Lwork,
int *devInfo );
cusolverStatus_t
cusolverDnDgeqrf(cusolverDnHandle_t handle,
int m,
int n,
double *A,
int lda,
double *TAU,
double *Workspace,
int Lwork,
int *devInfo );
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnCgeqrf(cusolverDnHandle_t handle,
int m,
int n,
cuComplex *A,
int lda,
cuComplex *TAU,
cuComplex *Workspace,
int Lwork,
int *devInfo );
cusolverStatus_t
cusolverDnZgeqrf(cusolverDnHandle_t handle,
int m,
int n,
cuDoubleComplex *A,
int lda,
cuDoubleComplex *TAU,
cuDoubleComplex *Workspace,
int Lwork,
int *devInfo );
This function computes the QR factorization of a m×n
matrix
\(A = Q*R\) 
where A
is an m×n
matrix, Q
is an m×n
matrix, and R
is a n×n
upper triangular matrix.
The user has to provide working space which is pointed by input parameter Workspace
. The input parameter Lwork
is size of the working space, and it is returned by geqrf_bufferSize()
.
The matrix R
is overwritten in upper triangular part of A
, including diagonal elements.
The matrix Q
is not formed explicitly, instead, a sequence of householder vectors are stored in lower triangular part of A
. The leading nonzero element of householder vector is assumed to be 1 such that output parameter TAU
contains the scaling factor τ
. If v
is original householder vector, q
is the new householder vector corresponding to τ
, satisying the following relation
\(I  2*v*v^{H} = I  \tau*q*q^{H}\) 
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
API of geqrf
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Number of rows of matrix 



Number of columns of matrix 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



<type> array of dimension at least 



Working space, <type> array of size 



Size of working array 



If 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.13. cusolverDnGeqrf()[DEPRECATED]
[[DEPRECATED]] use cusolverDnXgeqrf()
instead. The routine will be removed in the next major release.
The helper functions below can calculate the sizes needed for preallocated buffer.
cusolverStatus_t
cusolverDnGeqrf_bufferSize(
cusolverDnHandle_t handle,
cusolverDnParams_t params,
int64_t m,
int64_t n,
cudaDataType dataTypeA,
const void *A,
int64_t lda,
cudaDataType dataTypeTau,
const void *tau,
cudaDataType computeType,
size_t *workspaceInBytes )
The following routine:
cusolverStatus_t
cusolverDnGeqrf(
cusolverDnHandle_t handle,
cusolverDnParams_t params,
int64_t m,
int64_t n,
cudaDataType dataTypeA,
void *A,
int64_t lda,
cudaDataType dataTypeTau,
void *tau,
cudaDataType computeType,
void *pBuffer,
size_t workspaceInBytes,
int *info )
computes the QR factorization of an m×n
matrix
\(A = Q*R\) 
where A
is a m×n
matrix, Q
is an m×n
matrix, and R
is an n×n
upper triangular matrix using the generic API interface.
The user has to provide working space which is pointed by input parameter pBuffer
. The input parameter workspaceInBytes
is size in bytes of the working space, and it is returned by cusolverDnGeqrf_bufferSize()
.
The matrix R
is overwritten in upper triangular part of A
, including diagonal elements.
The matrix Q
is not formed explicitly, instead, a sequence of householder vectors are stored in lower triangular part of A
. The leading nonzero element of householder vector is assumed to be 1 such that output parameter TAU
contains the scaling factor τ
. If v
is original householder vector, q
is the new householder vector corresponding to τ
, satisying the following relation
\(I  2*v*v^{H} = I  \tau*q*q^{H}\) 
If output parameter info = i
(less than zero), the ith
parameter is wrong (not counting handle).
Currently, cusolverDnGeqrf
supports only the default algorithm.
Table of algorithms supported by cusolverDnGeqrf

Default algorithm. 
List of input arguments for cusolverDnGeqrf_bufferSize
and cusolverDnGeqrf
:
API of geqrf
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Structure with information collected by 



Number of rows of matrix 



Number of columns of matrix 



Data type of array 



Array of dimension 



Leading dimension of twodimensional array used to store matrix 



Array of dimension at least 



Data type of computation. 



Working space. Array of type 



Size in bytes of working array 



If 
The generic API has two different types, dataTypeA
is data type of the matrix A
and array tau
and computeType
is compute type of the operation. cusolverDnGeqrf
only supports the following four combinations.
Valid combination of data type and compute type
DataTypeA 
ComputeType 
Meaning 












Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.14. cusolverDn<t1><t2>gels()
These functions compute the solution of a system of linear equations with one or multiple right hand sides using mixed precision iterative refinement techniques based on the QR factorization Xgels. These functions are similar in term of functionalities to the full precision LAPACK QR (least squares) solver (Xgels, where X denotes Z,C,D,S) but it uses lower precision internally in order to provide faster time to solution, from here cames the name mixed precision. Mixed precision iterative refinement techniques means that the solver compute an QR factorization in lower precision and then iteratively refine the solution to achieve the accuracy of the Inputs/Outputs datatype precision. The <t1> corresponds to the Inputs/Outputs datatype precision while <t2> represent the internal lower precision at which the factorization will be carried on.
\(A \times X = B\) 
Where A
is mbyn
matrix and X
is nbynrhs
and B
is mbynrhs
matrices.
Functions API are designed to be as close as possible to LAPACK API to be considered as a quick and easy dropin replacement. Description of these functions is given below. <t1><t2>gels()
functions are designated by two floating point precisions The <t1>
corresponds to the main precision (e.g., Inputs/Outputs datatype precision) and the <t2>
represent the internal lower precision at which the factorization will be carried on. cusolver<t1><t2>gels()
first attempts to factorize the matrix in lower precision and use this factorization within an iterative refinement procedure to obtain a solution with same normwise backward error as the main precision <t1>
. If the approach fails to converge, then the method fallback to the main precision factorization and solve (Xgels) such a way that there is always a good solution at the output of these functions. If <t2>
is equal to <t1>
, then it is not a mixed precision process but rather a full one precision factorisation, solve and refinement within the same main precision.
The iterative refinement process is stopped if:
ITER > ITERMAX
or for all the RHS we have:
RNRM < SQRT(N)*XNRM*ANRM*EPS*BWDMAX
where
ITER is the number of the current iteration in the iterative refinement process
RNRM is the infinitynorm of the residual
XNRM is the infinitynorm of the solution
ANRM is the infinityoperatornorm of the matrix A
EPS is the machine epsilon that matches
LAPACK <t1>LAMCH('Epsilon')
The values ITERMAX and BWDMAX are fixed to 50 and 1.0 respectively.
The function returns value describes the results of the solving process. A CUSOLVER_STATUS_SUCCESS indicates that the function finished with success otherwise, it indicates if one of the API arguments is incorrect, or if the function did not finish with success. More details about the error will be in the niters
and the dinfo
API parameters. See their description below for more details. User should provide the required workspace allocated on device memory. The amount of bytes required can be queried by calling the respective function <t1><t2>gels_bufferSize()
.
We provide a large set of mixed precision functions that include half, bfloat and tensorfloat as a lower precision as well as same precision functions (e.g., main and lowest precision are equal <t2>
is equal to <t1>
). The following table specifies which precisions will be used for which interface function:
Tensor Float (TF32), introduced with NVIDIA Ampere Architecture GPUs, is the most robust tensor core accelerated compute mode for the iterative refinement solver. It is able to solve the widest range of problems in HPC arising from different applications and provides up to 4X and 5X speedup for real and complex systems, respectively. On Volta and Turing architecture GPUs, half precision tensor core acceleration is recommended. In cases where the iterative refinement solver fails to converge to the desired accuracy (main precision, INOUT data precision), it is recommended to use main precision as internal lowest precision (i.e., cusolverDn[DD,ZZ]gels
for the FP64 case).
Interface function 
Main precision (matrix, rhs and solution datatype) 
Lowest precision allowed to be used internally 























































cusolverDn<t1><t2>gels_bufferSize()
functions will return workspace buffer size in bytes required for the corresponding cusolverDn<t1><t2>gels()
function.
cusolverStatus_t
cusolverDnZZgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnZCgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnZKgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnZEgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnZYgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnCCgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
cuComplex * dA,
int ldda,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnCKgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
cuComplex * dA,
int ldda,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnCEgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
cuComplex * dA,
int ldda,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnCYgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
cuComplex * dA,
int ldda,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnDDgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
double * dA,
int ldda,
double * dB,
int lddb,
double * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnDSgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
double * dA,
int ldda,
double * dB,
int lddb,
double * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnDHgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
double * dA,
int ldda,
double * dB,
int lddb,
double * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnDBgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
double * dA,
int ldda,
double * dB,
int lddb,
double * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnDXgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
double * dA,
int ldda,
double * dB,
int lddb,
double * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnSSgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
float * dA,
int ldda,
float * dB,
int lddb,
float * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnSHgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
float * dA,
int ldda,
float * dB,
int lddb,
float * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnSBgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
float * dA,
int ldda,
float * dB,
int lddb,
float * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
cusolverStatus_t
cusolverDnSXgels_bufferSize(
cusolverHandle_t handle,
int m,
int n,
int nrhs,
float * dA,
int ldda,
float * dB,
int lddb,
float * dX,
int lddx,
void * dwork,
size_t * lwork_bytes);
Parameter 
Memory 
In/out 
Meaning 




Handle to the cusolverDN library context. 



Number of rows of the matrix 



Number of columns of the matrix 



Number of right hand sides to solve. Should be nonnegative. 



Matrix 



Leading dimension of twodimensional array used to store matrix 



Set of right hand sides 



Leading dimension of twodimensional array used to store matrix of right hand sides 



Set of soultion vectors 



Leading dimension of twodimensional array used to store matrix of solution vectors 



Pointer to device workspace. Not used and can be 



Pointer to a variable where required size of temporary workspace in bytes will be stored. Can’t be NULL. 
cusolverStatus_t cusolverDnZZgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnZCgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnZKgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnZEgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnZYgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
cuDoubleComplex * dA,
int ldda,
cuDoubleComplex * dB,
int lddb,
cuDoubleComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnCCgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
cuComplex * dA,
int ldda,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnCKgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
cuComplex * dA,
int ldda,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnCEgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
cuComplex * dA,
int ldda,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnCYgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
cuComplex * dA,
int ldda,
cuComplex * dB,
int lddb,
cuComplex * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnDDgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
double * dA,
int ldda,
double * dB,
int lddb,
double * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnDSgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
double * dA,
int ldda,
double * dB,
int lddb,
double * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnDHgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
double * dA,
int ldda,
double * dB,
int lddb,
double * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnDBgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
double * dA,
int ldda,
double * dB,
int lddb,
double * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnDXgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
double * dA,
int ldda,
double * dB,
int lddb,
double * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnSSgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
float * dA,
int ldda,
float * dB,
int lddb,
float * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnSHgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
float * dA,
int ldda,
float * dB,
int lddb,
float * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnSBgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
float * dA,
int ldda,
float * dB,
int lddb,
float * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
cusolverStatus_t cusolverDnSXgels(
cusolverDnHandle_t handle,
int m,
int n,
int nrhs,
float * dA,
int ldda,
float * dB,
int lddb,
float * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * niter,
int * dinfo);
Parameter 
Memory 
In/out 
Meaning 




Handle to the cusolverDN library context. 



Number of rows of the matrix 



Number of columns of the matrix 



Number of right hand sides to solve. Should be nonnegative. 



Matrix 



Leading dimension of twodimensional array used to store matrix 



Set of right hand sides 



Leading dimension of twodimensional array used to store matrix of right hand sides 



Set of soultion vectors 



Leading dimension of twodimensional array used to store matrix of solution vectors 



Pointer to an allocated workspace in device memory of size lwork_bytes. 



Size of the allocated device workspace. Should be at least what was returned by 



If iter is




Status of the IRS solver on the return. If 0  solve was successful. If dinfo =  
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed, for example:


The IRS solver supports compute capability 7.0 and above. The lowest precision options CUSOLVER_[CR]_16BF and CUSOLVER_[CR]_TF32 are only available on compute capability 8.0 and above. 



Numerical error related to niters <0, see niters description for more details. 

An internal error occurred; check the 
2.4.2.15. cusolverDnIRSXgels()
This function is designed to perform same functionality as cusolverDn<T1><T2>gels()
functions, but wrapped in a more generic and expert interface that gives user more control to parametrize the function as well as it provides more informations on output. cusolverDnIRSXgels()
allows additional control of the solver parameters such as setting:
the main precision (Inputs/Outputs precision) of the solver,
the lowest precision to be used internally by the solver,
the refinement solver type
the maximum allowed number of iterations in the refinement phase
the tolerance of the refinement solver
the fallback to main precision
and others
through the configuration parameters structure gels_irs_params
and its helper functions. For more details about what configuration can be set and its meaning please refer to all the functions in the cuSolverDN Helper Function Section that start with cusolverDnIRSParamsxxxx()
. Moreover, cusolverDnIRSXgels()
provides additional informations on the output such as the convergence history (e.g., the residual norms) at each iteration and the number of iterations needed to converge. For more details about what informations can be retrieved and its meaning please refer to all the functions in the cuSolverDN Helper Function Section that start with cusolverDnIRSInfosxxxx()
.
The function returns value describes the results of the solving process. A CUSOLVER_STATUS_SUCCESS indicates that the function finished with success otherwise, it indicates if one of the API arguments is incorrect, or if the configurations of params/infos structure is incorrect or if the function did not finish with success. More details about the error can be found by checking the niters
and the dinfo
API parameters. See their description below for further details. Users should provide the required workspace allocated on device for the cusolverDnIRSXgels()
function. The amount of bytes required for the function can be queried by calling the respective function cusolverDnIRSXgels_bufferSize()
. Note that, if the user would like a praticular configuration to be set via the params structure, it should be set before the call to cusolverDnIRSXgels_bufferSize()
to get the size of the required workspace.
The following table provides all possible combinations values for the lowest precision corresponding to the Inputs/Outputs data type. Note that if the lowest precision matches the Inputs/Outputs datatype, then main precision factorization will be used
Tensor Float (TF32), introduced with NVIDIA Ampere Architecture GPUs, is the most robust tensor core accelerated compute mode for the iterative refinement solver. It is able to solve the widest range of problems in HPC arising from different applications and provides up to 4X and 5X speedup for real and complex systems, respectively. On Volta and Turing architecture GPUs, half precision tensor core acceleration is recommended. In cases where the iterative refinement solver fails to converge to the desired accuracy (main precision, INOUT data precision), it is recommended to use main precision as internal lowest precision.
Inputs/Outputs Data Type (e.g., main precision) 
Supported values for the lowest precision 









The cusolverDnIRSXgels_bufferSize()
function return the required workspace buffer size in bytes for the corresponding cusolverDnXgels()
call with given gels_irs_params
configuration.
cusolverStatus_t
cusolverDnIRSXgels_bufferSize(
cusolverDnHandle_t handle,
cusolverDnIRSParams_t gels_irs_params,
cusolver_int_t m,
cusolver_int_t n,
cusolver_int_t nrhs,
size_t * lwork_bytes);
Parameters of cusolverDnIRSXgels_bufferSize() functions
Parameter 
Memory 
In/out 
Meaning 




Handle to the cusolverDn library context. 



Xgels configuration parameters 



Number of rows of the matrix 



Number of columns of the matrix 



Number of right hand sides to solve. Should be nonnegative. Note that, 



Pointer to a variable, where the required size in bytes, of the workspace will be stored after a call to cusolverDnIRSXgels_bufferSize. Can’t be NULL. 
cusolverStatus_t cusolverDnIRSXgels(
cusolverDnHandle_t handle,
cusolverDnIRSParams_t gels_irs_params,
cusolverDnIRSInfos_t gels_irs_infos,
int m,
int n,
int nrhs,
void * dA,
int ldda,
void * dB,
int lddb,
void * dX,
int lddx,
void * dWorkspace,
size_t lwork_bytes,
int * dinfo);
Parameter 
Memory 
In/out 
Meaning 




Handle to the cusolverDn library context. 



Configuration parameters structure, can serve one or more calls to any IRS solver 



Info structure, where information about a particular solve will be stored. The 



Number of rows of the matrix 



Number of columns of the matrix 



Number of right hand sides to solve. Should be nonnegative. Note that, 



Matrix 



Leading dimension of twodimensional array used to store matrix 



Set of right hand sides 



Leading dimension of twodimensional array used to store matrix of right hand sides 



Set of soultion vectors 



Leading dimension of twodimensional array used to store matrix of solution vectors 



Pointer to an allocated workspace in device memory of size lwork_bytes. 



Size of the allocated device workspace. Should be at least what was returned by 



If




Status of the IRS solver on the return. If 0  solve was successful. If dinfo =  
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed, for example:


The IRS solver supports compute capability 7.0 and above. The lowest precision options CUSOLVER_[CR]_16BF and CUSOLVER_[CR]_TF32 are only available on compute capability 8.0 and above. 



Numerical error related to 

An internal error occured, check the 

The configuration parameter 

One of the configuration parameter in the 

The main and/or the lowest precision configuration parameter in the 

The maxiter configuration parameter in the 

The refinement solver configuration parameter in the gels_irs_params structure is not valid. 

One of the configuration parameter in the 

The information structure 

CPU memory allocation failed, most likely during the allocation of the residual array that store the residual norms. 
2.4.2.16. cusolverDn<t>ormqr()
These helper functions calculate the size of work buffers needed. Please visit cuSOLVER Library Samples  ormqr for a code example.
cusolverStatus_t
cusolverDnSormqr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const float *A,
int lda,
const float *tau,
const float *C,
int ldc,
int *lwork);
cusolverStatus_t
cusolverDnDormqr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const double *A,
int lda,
const double *tau,
const double *C,
int ldc,
int *lwork);
cusolverStatus_t
cusolverDnCunmqr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const cuComplex *A,
int lda,
const cuComplex *tau,
const cuComplex *C,
int ldc,
int *lwork);
cusolverStatus_t
cusolverDnZunmqr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
const cuDoubleComplex *C,
int ldc,
int *lwork);
The S and D data types are real valued single and double precision, respectively.
cusolverStatus_t
cusolverDnSormqr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const float *A,
int lda,
const float *tau,
float *C,
int ldc,
float *work,
int lwork,
int *devInfo);
cusolverStatus_t
cusolverDnDormqr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const double *A,
int lda,
const double *tau,
double *C,
int ldc,
double *work,
int lwork,
int *devInfo);
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnCunmqr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const cuComplex *A,
int lda,
const cuComplex *tau,
cuComplex *C,
int ldc,
cuComplex *work,
int lwork,
int *devInfo);
cusolverStatus_t
cusolverDnZunmqr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
cuDoubleComplex *C,
int ldc,
cuDoubleComplex *work,
int lwork,
int *devInfo);
This function overwrites m×n
matrix C
by
\(C = \left\{ \begin{matrix} {\text{op}(Q)*C} & {\text{if~}\textsf{side\ ==\ CUBLAS\_SIDE\_LEFT}} \\ {C*\text{op}(Q)} & {\text{if~}\textsf{side\ ==\ CUBLAS\_SIDE\_RIGHT}} \\ \end{matrix} \right.\) 
The operation of Q
is defined by
\(\text{op}(Q) = \left\{ \begin{matrix} Q & {\text{if~}\textsf{transa\ ==\ CUBLAS\_OP\_N}} \\ Q^{T} & {\text{if~}\textsf{transa\ ==\ CUBLAS\_OP\_T}} \\ Q^{H} & {\text{if~}\textsf{transa\ ==\ CUBLAS\_OP\_C}} \\ \end{matrix} \right.\) 
Q
is a unitary matrix formed by a sequence of elementary reflection vectors from QR factorization (geqrf
) of A
.
Q
=H(1)
H(2)
… H(k)
Q
is of order m
if side
= CUBLAS_SIDE_LEFT
and of order n
if side
= CUBLAS_SIDE_RIGHT
.
The user has to provide working space which is pointed by input parameter work
. The input parameter lwork
is size of the working space, and it is returned by geqrf_bufferSize()
or ormqr_bufferSize()
.
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
The user can combine geqrf
, ormqr
and trsm
to complete a linear solver or a leastsquare solver.
API of ormqr
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDn library context. 



Indicates if matrix 



Operation 



Number of rows of matrix 



Number of columns of matrix 



Number of elementary relfections whose product defines the matrix Q. 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



<type> array of dimension at least 



<type> array of size 



Leading dimension of twodimensional array of matrix 



Working space, <type> array of size 



Size of working array 



If 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.17. cusolverDn<t>orgqr()
These helper functions calculate the size of work buffers needed. Please visit cuSOLVER Library Samples  orgqr for a code example.
cusolverStatus_t
cusolverDnSorgqr_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int k,
const float *A,
int lda,
const float *tau,
int *lwork);
cusolverStatus_t
cusolverDnDorgqr_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int k,
const double *A,
int lda,
const double *tau,
int *lwork);
cusolverStatus_t
cusolverDnCungqr_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int k,
const cuComplex *A,
int lda,
const cuComplex *tau,
int *lwork);
cusolverStatus_t
cusolverDnZungqr_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int k,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
int *lwork);
The S and D data types are real valued single and double precision, respectively.
cusolverStatus_t
cusolverDnSorgqr(
cusolverDnHandle_t handle,
int m,
int n,
int k,
float *A,
int lda,
const float *tau,
float *work,
int lwork,
int *devInfo);
cusolverStatus_t
cusolverDnDorgqr(
cusolverDnHandle_t handle,
int m,
int n,
int k,
double *A,
int lda,
const double *tau,
double *work,
int lwork,
int *devInfo);
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnCungqr(
cusolverDnHandle_t handle,
int m,
int n,
int k,
cuComplex *A,
int lda,
const cuComplex *tau,
cuComplex *work,
int lwork,
int *devInfo);
cusolverStatus_t
cusolverDnZungqr(
cusolverDnHandle_t handle,
int m,
int n,
int k,
cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
cuDoubleComplex *work,
int lwork,
int *devInfo);
This function overwrites m×n
matrix A
by
\(Q = {H(1)}*{H(2)}*{...}*{H(k)}\) 
where Q
is a unitary matrix formed by a sequence of elementary reflection vectors stored in A
.
The user has to provide working space which is pointed by input parameter work
. The input parameter lwork
is size of the working space, and it is returned by orgqr_bufferSize()
.
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
The user can combine geqrf
, orgqr
to complete orthogonalization.
API of ormqr
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Number of rows of matrix 



Number of columns of matrix 



Number of elementary relfections whose product defines the matrix 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



<type> array of dimension 



Working space, <type> array of size 



Size of working array 



If 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.18. cusolverDn<t>sytrf()
These helper functions calculate the size of the needed buffers.
cusolverStatus_t
cusolverDnSsytrf_bufferSize(cusolverDnHandle_t handle,
int n,
float *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnDsytrf_bufferSize(cusolverDnHandle_t handle,
int n,
double *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnCsytrf_bufferSize(cusolverDnHandle_t handle,
int n,
cuComplex *A,
int lda,
int *Lwork );
cusolverStatus_t
cusolverDnZsytrf_bufferSize(cusolverDnHandle_t handle,
int n,
cuDoubleComplex *A,
int lda,
int *Lwork );
The S and D data types are real valued single and double precision, respectively.
cusolverStatus_t
cusolverDnSsytrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
int *ipiv,
float *work,
int lwork,
int *devInfo );
cusolverStatus_t
cusolverDnDsytrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
int *ipiv,
double *work,
int lwork,
int *devInfo );
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnCsytrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
int *ipiv,
cuComplex *work,
int lwork,
int *devInfo );
cusolverStatus_t
cusolverDnZsytrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
int *ipiv,
cuDoubleComplex *work,
int lwork,
int *devInfo );
This function computes the BunchKaufman factorization of a n×n
symmetric indefinite matrix
A
is a n×n
symmetric matrix, only lower or upper part is meaningful. The input parameter uplo
which part of the matrix is used. The function would leave other part untouched.
If input parameter uplo
is CUBLAS_FILL_MODE_LOWER
, only lower triangular part of A
is processed, and replaced by lower triangular factor L
and block diagonal matrix D
. Each block of D
is either 1x1 or 2x2 block, depending on pivoting.
\(P*A*P^{T} = L*D*L^{T}\) 
If input parameter uplo
is CUBLAS_FILL_MODE_UPPER
, only upper triangular part of A
is processed, and replaced by upper triangular factor U
and block diagonal matrix D
.
\(P*A*P^{T} = U*D*U^{T}\) 
The user has to provide working space which is pointed by input parameter work
. The input parameter lwork
is size of the working space, and it is returned by sytrf_bufferSize()
.
If BunchKaufman factorization failed, i.e. A
is singular. The output parameter devInfo = i
would indicate D(i,i)=0
.
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
The output parameter devIpiv
contains pivoting sequence. If devIpiv(i) = k > 0
, D(i,i)
is 1x1 block, and ith
row/column of A
is interchanged with kth
row/column of A
. If uplo
is CUBLAS_FILL_MODE_UPPER
and devIpiv(i1) = devIpiv(i) = m < 0
, D(i1:i,i1:i)
is a 2x2 block, and (i1)th
row/column is interchanged with mth
row/column. If uplo
is CUBLAS_FILL_MODE_LOWER
and devIpiv(i+1) = devIpiv(i) = m < 0
, D(i:i+1,i:i+1)
is a 2x2 block, and (i+1)th
row/column is interchanged with mth
row/column.
API of sytrf
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Indicates if matrix 



Number of rows and columns of matrix 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



Array of size at least 



Working space, <type> array of size 



Size of working space 



if 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.2.19. cusolverDn<t>potrfBatched()
The S and D data types are real valued single and double precision, respectively. Please visit cuSOLVER Library Samples  potrfBatched for a code example.
cusolverStatus_t
cusolverDnSpotrfBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *Aarray[],
int lda,
int *infoArray,
int batchSize);
cusolverStatus_t
cusolverDnDpotrfBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *Aarray[],
int lda,
int *infoArray,
int batchSize);
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnCpotrfBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *Aarray[],
int lda,
int *infoArray,
int batchSize);
cusolverStatus_t
cusolverDnZpotrfBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *Aarray[],
int lda,
int *infoArray,
int batchSize);
This function computes the Cholesky factorization of a squence of Hermitian positivedefinite matrices.
Each Aarray[i] for i=0,1,..., batchSize1
is a n×n
Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo
indicates which part of the matrix is used.
If input parameter uplo
is CUBLAS_FILL_MODE_LOWER
, only lower triangular part of A
is processed, and replaced by lower triangular Cholesky factor L
.
\(A = L*L^{H}\) 
If input parameter uplo
is CUBLAS_FILL_MODE_UPPER
, only upper triangular part of A
is processed, and replaced by upper triangular Cholesky factor U
.
\(A = U^{H}*U\) 
If Cholesky factorization failed, i.e. some leading minor of A
is not positive definite, or equivalently some diagonal elements of L
or U
is not a real number. The output parameter infoArray
would indicate smallest leading minor of A
which is not positive definite.
infoArray
is an integer array of size batchsize
. If potrfBatched
returns CUSOLVER_STATUS_INVALID_VALUE
, infoArray[0] = i
(less than zero), meaning that the ith
parameter is wrong (not counting handle). If potrfBatched
returns CUSOLVER_STATUS_SUCCESS
but infoArray[i] = k
is positive, then ith
matrix is not positive definite and the Cholesky factorization failed at row k
.
Remark: the other part of A
is used as a workspace. For example, if uplo
is CUBLAS_FILL_MODE_UPPER
, upper triangle of A
contains cholesky factor U
and lower triangle of A
is destroyed after potrfBatched
.
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Indicates if lower or upper part is stored; the other part is used as a workspace. 



Number of rows and columns of matrix 



Array of pointers to <type> array of dimension 



Leading dimension of twodimensional array used to store each matrix 



Array of size 



Number of pointers in 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

An internal operation failed. 
2.4.2.20. cusolverDn<t>potrsBatched()
cusolverStatus_t
cusolverDnSpotrsBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
float *Aarray[],
int lda,
float *Barray[],
int ldb,
int *info,
int batchSize);
cusolverStatus_t
cusolverDnDpotrsBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
double *Aarray[],
int lda,
double *Barray[],
int ldb,
int *info,
int batchSize);
cusolverStatus_t
cusolverDnCpotrsBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
cuComplex *Aarray[],
int lda,
cuComplex *Barray[],
int ldb,
int *info,
int batchSize);
cusolverStatus_t
cusolverDnZpotrsBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
cuDoubleComplex *Aarray[],
int lda,
cuDoubleComplex *Barray[],
int ldb,
int *info,
int batchSize);
This function solves a squence of linear systems
\({A\lbrack i\rbrack}*{X\lbrack i\rbrack} = {B\lbrack i\rbrack}\) 
where each Aarray[i] for i=0,1,..., batchSize1
is a n×n
Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo
indicates which part of the matrix is used.
The user has to call potrfBatched
first to factorize matrix Aarray[i]
. If input parameter uplo
is CUBLAS_FILL_MODE_LOWER
, A
is lower triangular Cholesky factor L
correspoding to \(A = L*L^{H}\) . If input parameter uplo
is CUBLAS_FILL_MODE_UPPER
, A
is upper triangular Cholesky factor U
corresponding to \(A = U^{H}*U\) .
The operation is inplace, i.e. matrix X
overwrites matrix B
with the same leading dimension ldb
.
The output parameter info
is a scalar. If info = i
(less than zero), the ith
parameter is wrong (not counting handle).
Remark 1: only nrhs=1
is supported.
Remark 2: infoArray
from potrfBatched
indicates if the matrix is positive definite. info
from potrsBatched
only shows which input parameter is wrong (not counting handle).
Remark 3: the other part of A
is used as a workspace. For example, if uplo
is CUBLAS_FILL_MODE_UPPER
, upper triangle of A
contains cholesky factor U
and lower triangle of A
is destroyed after potrsBatched
.
Please visit cuSOLVER Library Samples  potrfBatched for a code example.
API of potrsBatched
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolveDN library context. 



Indicates if matrix 



Number of rows and columns of matrix 



Number of columns of matrix 



Array of pointers to <type> array of dimension 



Leading dimension of twodimensional array used to store each matrix 



Array of pointers to <type> array of dimension 



Leading dimension of twodimensional array used to store each matrix 



If 



Number of pointers in 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

An internal operation failed. 
2.4.3. Dense Eigenvalue Solver Reference (legacy)
This chapter describes eigenvalue solver API of cuSolverDN, including bidiagonalization and SVD.
2.4.3.1. cusolverDn<t>gebrd()
These helper functions calculate the size of work buffers needed.
cusolverStatus_t
cusolverDnSgebrd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *Lwork );
cusolverStatus_t
cusolverDnDgebrd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *Lwork );
cusolverStatus_t
cusolverDnCgebrd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *Lwork );
cusolverStatus_t
cusolverDnZgebrd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *Lwork );
The S and D data types are real valued single and double precision, respectively.
cusolverStatus_t
cusolverDnSgebrd(cusolverDnHandle_t handle,
int m,
int n,
float *A,
int lda,
float *D,
float *E,
float *TAUQ,
float *TAUP,
float *Work,
int Lwork,
int *devInfo );
cusolverStatus_t
cusolverDnDgebrd(cusolverDnHandle_t handle,
int m,
int n,
double *A,
int lda,
double *D,
double *E,
double *TAUQ,
double *TAUP,
double *Work,
int Lwork,
int *devInfo );
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnCgebrd(cusolverDnHandle_t handle,
int m,
int n,
cuComplex *A,
int lda,
float *D,
float *E,
cuComplex *TAUQ,
cuComplex *TAUP,
cuComplex *Work,
int Lwork,
int *devInfo );
cusolverStatus_t
cusolverDnZgebrd(cusolverDnHandle_t handle,
int m,
int n,
cuDoubleComplex *A,
int lda,
double *D,
double *E,
cuDoubleComplex *TAUQ,
cuDoubleComplex *TAUP,
cuDoubleComplex *Work,
int Lwork,
int *devInfo );
This function reduces a general m×n
matrix A
to a real upper or lower bidiagonal form B
by an orthogonal transformation: \(Q^{H}*A*P = B\)
If m>=n
, B
is upper bidiagonal; if m<n
, B
is lower bidiagonal.
The matrix Q
and P
are overwritten into matrix A
in the following sense:
if
m>=n
, the diagonal and the first superdiagonal are overwritten with the upper bidiagonal matrixB
; the elements below the diagonal, with the arrayTAUQ
, represent the orthogonal matrixQ
as a product of elementary reflectors, and the elements above the first superdiagonal, with the arrayTAUP
, represent the orthogonal matrixP
as a product of elementary reflectors.if
m<n
, the diagonal and the first subdiagonal are overwritten with the lower bidiagonal matrixB
; the elements below the first subdiagonal, with the arrayTAUQ
, represent the orthogonal matrixQ
as a product of elementary reflectors, and the elements above the diagonal, with the arrayTAUP
, represent the orthogonal matrixP
as a product of elementary reflectors.
The user has to provide working space which is pointed by input parameter Work
. The input parameter Lwork
is size of the working space, and it is returned by gebrd_bufferSize()
.
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
Remark: gebrd
only supports m>=n
.
API of gebrd
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Number of rows of matrix 



Number of columns of matrix 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



Real array of dimension 



Real array of dimension 



<type> array of dimension 



<type> array of dimension 



Working space, <type> array of size 



Size of 



If 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.3.2. cusolverDn<t>orgbr()
These helper functions calculate the size of work buffers needed.
cusolverStatus_t
cusolverDnSorgbr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
const float *A,
int lda,
const float *tau,
int *lwork);
cusolverStatus_t
cusolverDnDorgbr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
const double *A,
int lda,
const double *tau,
int *lwork);
cusolverStatus_t
cusolverDnCungbr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
const cuComplex *A,
int lda,
const cuComplex *tau,
int *lwork);
cusolverStatus_t
cusolverDnZungbr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
int *lwork);
The S and D data types are real valued single and double precision, respectively.
cusolverStatus_t
cusolverDnSorgbr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
float *A,
int lda,
const float *tau,
float *work,
int lwork,
int *devInfo);
cusolverStatus_t
cusolverDnDorgbr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
double *A,
int lda,
const double *tau,
double *work,
int lwork,
int *devInfo);
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnCungbr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
cuComplex *A,
int lda,
const cuComplex *tau,
cuComplex *work,
int lwork,
int *devInfo);
cusolverStatus_t
cusolverDnZungbr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
cuDoubleComplex *work,
int lwork,
int *devInfo);
This function generates one of the unitary matrices Q
or P**H
determined by gebrd
when reducing a matrix A to bidiagonal form: \(Q^{H}*A*P = B\)
Q
and P**H
are defined as products of elementary reflectors H(i) or G(i) respectively.
The user has to provide working space which is pointed by input parameter work
. The input parameter lwork
is size of the working space, and it is returned by orgbr_bufferSize()
.
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
API of orgbr
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



If 



Number of rows of matrix 



If 



If 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



<type> array of dimension 



Working space, <type> array of size 



Size of working array 



If 
Status Returned

the operation completed successfully. 

the library was not initialized. 

invalid parameters were passed ( 

the device only supports compute capability 2.0 and above. 

an internal operation failed. 
2.4.3.3. cusolverDn<t>sytrd()
These helper functions calculate the size of work buffers needed.
cusolverStatus_t
cusolverDnSsytrd_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const float *A,
int lda,
const float *d,
const float *e,
const float *tau,
int *lwork);
cusolverStatus_t
cusolverDnDsytrd_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const double *A,
int lda,
const double *d,
const double *e,
const double *tau,
int *lwork);
cusolverStatus_t
cusolverDnChetrd_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const cuComplex *A,
int lda,
const float *d,
const float *e,
const cuComplex *tau,
int *lwork);
cusolverStatus_t
cusolverDnZhetrd_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const cuDoubleComplex *A,
int lda,
const double *d,
const double *e,
const cuDoubleComplex *tau,
int *lwork);
The S and D data types are real valued single and double precision, respectively.
cusolverStatus_t
cusolverDnSsytrd(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
float *d,
float *e,
float *tau,
float *work,
int lwork,
int *devInfo);
cusolverStatus_t
cusolverDnDsytrd(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
double *d,
double *e,
double *tau,
double *work,
int lwork,
int *devInfo);
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnChetrd(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
float *d,
float *e,
cuComplex *tau,
cuComplex *work,
int lwork,
int *devInfo);
cusolverStatus_t CUDENSEAPI cusolverDnZhetrd(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
double *d,
double *e,
cuDoubleComplex *tau,
cuDoubleComplex *work,
int lwork,
int *devInfo);
This function reduces a general symmetric (Hermitian) n×n
matrix A
to real symmetric tridiagonal form T
by an orthogonal transformation: \(Q^{H}*A*Q = T\)
As an output, A
contains T
and householder reflection vectors. If uplo
= CUBLAS_FILL_MODE_UPPER
, the diagonal and first superdiagonal of A
are overwritten by the corresponding elements of the tridiagonal matrix T
, and the elements above the first superdiagonal, with the array tau
, represent the orthogonal matrix Q
as a product of elementary reflectors; If uplo
= CUBLAS_FILL_MODE_LOWER
, the diagonal and first subdiagonal of A
are overwritten by the corresponding elements of the tridiagonal matrix T
, and the elements below the first subdiagonal, with the array tau
, represent the orthogonal matrix Q
as a product of elementary reflectors.
The user has to provide working space which is pointed by input parameter work
. The input parameter lwork
is size of the working space, and it is returned by sytrd_bufferSize()
.
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
API of sytrd
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 



Specifies which part of 



Number of rows (columns) of matrix 



<type> array of dimension 



Leading dimension of twodimensional array used to store matrix 



Real array of dimension 



Real array of dimension 



<type> array of dimension 



Working space, <type> array of size 



Size of 



If 
Status Returned

The operation completed successfully. 

The library was not initialized. 

Invalid parameters were passed ( 

The device only supports compute capability 2.0 and above. 

An internal operation failed. 
2.4.3.4. cusolverDn<t>ormtr()
These helper functions calculate the size of work buffers needed.
cusolverStatus_t
cusolverDnSormtr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
const float *A,
int lda,
const float *tau,
const float *C,
int ldc,
int *lwork);
cusolverStatus_t
cusolverDnDormtr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
const double *A,
int lda,
const double *tau,
const double *C,
int ldc,
int *lwork);
cusolverStatus_t
cusolverDnCunmtr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
const cuComplex *A,
int lda,
const cuComplex *tau,
const cuComplex *C,
int ldc,
int *lwork);
cusolverStatus_t
cusolverDnZunmtr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
const cuDoubleComplex *C,
int ldc,
int *lwork);
The S and D data types are real valued single and double precision, respectively.
cusolverStatus_t
cusolverDnSormtr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
float *A,
int lda,
float *tau,
float *C,
int ldc,
float *work,
int lwork,
int *devInfo);
cusolverStatus_t
cusolverDnDormtr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
double *A,
int lda,
double *tau,
double *C,
int ldc,
double *work,
int lwork,
int *devInfo);
The C and Z data types are complex valued single and double precision, respectively.
cusolverStatus_t
cusolverDnCunmtr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
cuComplex *A,
int lda,
cuComplex *tau,
cuComplex *C,
int ldc,
cuComplex *work,
int lwork,
int *devInfo);
cusolverStatus_t
cusolverDnZunmtr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
cuDoubleComplex *A,
int lda,
cuDoubleComplex *tau,
cuDoubleComplex *C,
int ldc,
cuDoubleComplex *work,
int lwork,
int *devInfo);
This function overwrites m×n
matrix C
by
\(C = \left\{ \begin{matrix} {\text{op}(Q)*C} & {\text{if~}\textsf{side\ ==\ CUBLAS\_SIDE\_LEFT}} \\ {C*\text{op}(Q)} & {\text{if~}\textsf{side\ ==\ CUBLAS\_SIDE\_RIGHT}} \\ \end{matrix} \right.\) 
where Q
is a unitary matrix formed by a sequence of elementary reflection vectors from sytrd
.
The operation on Q
is defined by
\(\text{op}(Q) = \left\{ \begin{matrix} Q & {\text{if~}\textsf{transa\ ==\ CUBLAS\_OP\_N}} \\ Q^{T} & {\text{if~}\textsf{transa\ ==\ CUBLAS\_OP\_T}} \\ Q^{H} & {\text{if~}\textsf{transa\ ==\ CUBLAS\_OP\_C}} \\ \end{matrix} \right.\) 
The user has to provide working space which is pointed by input parameter work
. The input parameter lwork
is size of the working space, and it is returned by ormtr_bufferSize()
.
If output parameter devInfo = i
(less than zero), the ith
parameter is wrong (not counting handle).
API of ormtr
Parameter 
Memory 
In/out 
Meaning 




Handle to the cuSolverDN library context. 











Operation 



Number of rows of matrix 



Number of columns of matrix 



<type> array of dimension 