1. Introduction
 The cuSolver API on a single GPU
 The cuSolverMG API on a single node multiGPU
The intent of cuSolver is to provide useful LAPACKlike features, such as common matrix factorization and triangular solve routines for dense matrices, a sparse leastsquares solver and an eigenvalue solver. In addition cuSolver provides a new refactorization library useful for solving sequences of matrices with a shared sparsity pattern.
cuSolver combines three separate components under a single umbrella. The first part of cuSolver is called cuSolverDN, and deals with dense matrix factorization and solve routines such as LU, QR, SVD and LDLT, as well as useful utilities such as matrix and vector permutations.
Next, cuSolverSP provides a new set of sparse routines based on a sparse QR factorization. Not all matrices have a good sparsity pattern for parallelism in factorization, so the cuSolverSP library also provides a CPU path to handle those sequentiallike matrices. For those matrices with abundant parallelism, the GPU path will deliver higher performance. The library is designed to be called from C and C++.
The final part is cuSolverRF, a sparse refactorization package that can provide very good performance when solving a sequence of matrices where only the coefficients are changed but the sparsity pattern remains the same.
The GPU path of the cuSolver library assumes data is already in the device memory. It is the responsibility of the developer to allocate memory and to copy data between GPU memory and CPU memory using standard CUDA runtime API routines, such as cudaMalloc(), cudaFree(), cudaMemcpy(), and cudaMemcpyAsync().
cuSolverMg is GPUaccelerated ScaLAPACK. By now, cuSolverMg supports 1D column block cyclic layout and provides symmetric eigenvalue solver.
cuSolverDN: Dense LAPACK
The cuSolverDN library was designed to solve dense linear systems of the form
$Ax=b$ 
where the coefficient matrix $A\in {R}^{\mathrm{nxn}}$ , righthandside vector $b\in {R}^{n}$ and solution vector $x\in {R}^{n}$
The cuSolverDN library provides QR factorization and LU with partial pivoting to handle a general matrix A, which may be nonsymmetric. Cholesky factorization is also provided for symmetric/Hermitian matrices. For symmetric indefinite matrices, we provide BunchKaufman (LDL) factorization.
The cuSolverDN library also provides a helpful bidiagonalization routine and singular value decomposition (SVD).
The cuSolverDN library targets computationallyintensive and popular routines in LAPACK, and provides an API compatible with LAPACK. The user can accelerate these timeconsuming routines with cuSolverDN and keep others in LAPACK without a major change to existing code.
cuSolverSP: Sparse LAPACK
The cuSolverSP library was mainly designed to a solve sparse linear system
$Ax=b$ 
and the leastsquares problem
$x=\mathrm{argmin}\mathrm{}A*zb\mathrm{}$ 
where sparse matrix $A\in {R}^{\mathrm{mxn}}$ , righthandside vector $b\in {R}^{m}$ and solution vector $x\in {R}^{n}$ . For a linear system, we require m=n.
The core algorithm is based on sparse QR factorization. The matrix A is accepted in CSR format. If matrix A is symmetric/Hermitian, the user has to provide a full matrix, ie fill missing lower or upper part.
If matrix A is symmetric positive definite and the user only needs to solve $Ax=b$ , Cholesky factorization can work and the user only needs to provide the lower triangular part of A.
On top of the linear and leastsquares solvers, the cuSolverSP library provides a simple eigenvalue solver based on shiftinverse power method, and a function to count the number of eigenvalues contained in a box in the complex plane.
cuSolverRF: Refactorization
The cuSolverRF library was designed to accelerate solution of sets of linear systems by fast refactorization when given new coefficients in the same sparsity pattern
${A}_{i}{x}_{i}={f}_{i}$ 
where a sequence of coefficient matrices ${A}_{i}\in {R}^{\mathrm{nxn}}$ , righthandsides ${f}_{i}\in {R}^{n}$ and solutions ${x}_{i}\in {R}^{n}$ are given for i=1,...,k.
The cuSolverRF library is applicable when the sparsity pattern of the coefficient matrices ${A}_{i}$ as well as the reordering to minimize fillin and the pivoting used during the LU factorization remain the same across these linear systems. In that case, the first linear system (i=1) requires a full LU factorization, while the subsequent linear systems (i=2,...,k) require only the LU refactorization. The later can be performed using the cuSolverRF library.
Notice that because the sparsity pattern of the coefficient matrices, the reordering and pivoting remain the same, the sparsity pattern of the resulting triangular factors ${L}_{i}$ and ${U}_{i}$ also remains the same. Therefore, the real difference between the full LU factorization and LU refactorization is that the required memory is known ahead of time.
1.4. Naming Conventions
The cuSolverDN library functions are available for data types float, double, cuComplex, and cuDoubleComplex. The naming convention is as follows:
cusolverDn<t><operation> 
where <t> can be S, D, C, Z, or X, corresponding to the data types float, double, cuComplex, cuDoubleComplex, and the generic type, respectively. <operation> can be Cholesky factorization (potrf), LU with partial pivoting (getrf), QR factorization (geqrf) and BunchKaufman factorization (sytrf).
The cuSolverSP library functions are available for data types float, double, cuComplex, and cuDoubleComplex. The naming convention is as follows:
cusolverSp[Host]<t>[<matrix data format>]<operation>[<output matrix data format>]<based on> 
where cuSolverSp is the GPU path and cusolverSpHost is the corresponding CPU path. <t> can be S, D, C, Z, or X, corresponding to the data types float, double, cuComplex, cuDoubleComplex, and the generic type, respectively.
The <matrix data format> is csr, compressed sparse row format.
The <operation> can be ls, lsq, eig, eigs, corresponding to linear solver, leastsquare solver, eigenvalue solver and number of eigenvalues in a box, respectively.
The <output matrix data format> can be v or m, corresponding to a vector or a matrix.
<based on> describes which algorithm is used. For example, qr (sparse QR factorization) is used in linear solver and leastsquare solver.
All of the functions have the return type cusolverStatus_t and are explained in more detail in the chapters that follow.
routine  data format  operation  output format  based on 
csrlsvlu  csr  linear solver (ls)  vector (v)  LU (lu) with partial pivoting 
csrlsvqr  csr  linear solver (ls)  vector (v)  QR factorization (qr) 
csrlsvchol  csr  linear solver (ls)  vector (v)  Cholesky factorization (chol) 
csrlsqvqr  csr  leastsquare solver (lsq)  vector (v)  QR factorization (qr) 
csreigvsi  csr  eigenvalue solver (eig)  vector (v)  shiftinverse 
csreigs  csr  number of eigenvalues in a box (eigs)  
csrsymrcm  csr  Symmetric Reverse CuthillMcKee (symrcm) 
The cuSolverRF library routines are available for data type double. Most of the routines follow the naming convention:
cusolverRf_<operation>_[[Host]](...) 
where the trailing optional Host qualifier indicates the data is accessed on the host versus on the device, which is the default. The <operation> can be Setup, Analyze, Refactor, Solve, ResetValues, AccessBundledFactors and ExtractSplitFactors.
Finally, the return type of the cuSolverRF library routines is cusolverStatus_t.
1.5. Asynchronous Execution
The cuSolver library functions prefer to keep asynchronous execution as much as possible. Developers can always use the cudaDeviceSynchronize() function to ensure that the execution of a particular cuSolver library routine has completed.
A developer can also use the cudaMemcpy() routine to copy data from the device to the host and vice versa, using the cudaMemcpyDeviceToHost and cudaMemcpyHostToDevice parameters, respectively. In this case there is no need to add a call to cudaDeviceSynchronize() because the call to cudaMemcpy() with the above parameters is blocking and completes only when the results are ready on the host.
1.6. Library Property
The libraryPropertyType data type is an enumeration of library property types. (ie. CUDA version X.Y.Z would yield MAJOR_VERSION=X, MINOR_VERSION=Y, PATCH_LEVEL=Z)
typedef enum libraryPropertyType_t { MAJOR_VERSION, MINOR_VERSION, PATCH_LEVEL } libraryPropertyType;
The following code can show the version of cusolver library.
int major=1,minor=1,patch=1; cusolverGetProperty(MAJOR_VERSION, &major); cusolverGetProperty(MINOR_VERSION, &minor); cusolverGetProperty(PATCH_LEVEL, &patch); printf("CUSOLVER Version (Major,Minor,PatchLevel): %d.%d.%d\n", major,minor,patch);
1.7. high precision package
The cusolver library uses high precision for iterative refinement when necessary.
2. Using the CUSOLVER API
2.1. General description
This chapter describes how to use the cuSolver library API. It is not a reference for the cuSolver API data types and functions; that is provided in subsequent chapters.
2.1.1. Thread Safety
The library is threadsafe, and its functions can be called from multiple host threads.
2.1.2. Scalar Parameters
In the cuSolver API, the scalar parameters can be passed by reference on the host.
2.1.3. Parallelism with Streams
If the application performs several small independent computations, or if it makes data transfers in parallel with the computation, then CUDA streams can be used to overlap these tasks.
 Create CUDA streams using the function cudaStreamCreate(), and
 Set the stream to be used by each individual cuSolver library routine by calling, for example, cusolverDnSetStream(), just prior to calling the actual cuSolverDN routine.
The computations performed in separate streams would then be overlapped automatically on the GPU, when possible. This approach is especially useful when the computation performed by a single task is relatively small, and is not enough to fill the GPU with work, or when there is a data transfer that can be performed in parallel with the computation.
2.1.4. Link Thirdparty LAPACK Library
 If you use libcusolver_static.a, then you must link with liblapack_static.a explicitly, otherwise the linker will report missing symbols. No conflict of symbols between liblapack_static.a and other thirdparty LAPACK library, you are free to link the latter to your application.
 The liblapack_static.a is built inside libcusolver.so. Hence, if you use libcusolver.so, then you don't need to specify a LAPACK library. The libcusolver.so will not pick up any routines from the thirdparty LAPACK library even you link the application with it.
2.1.5. convention of info
Each LAPACK routine returns an info which indicates the position of invalid parameter. If info = i, then ith parameter is invalid. To be consistent with base1 in LAPACK, cusolver does not report invalid handle into info. Instead, cusolver returns CUSOLVER_STATUS_NOT_INITIALIZED for invalid handle.
2.1.6. usage of _bufferSize
There is no cudaMalloc inside cuSolver library, the user must allocate the device workspace explicitly. The routine xyz_bufferSize is to query the size of workspace of the routine xyz, for example xyz = potrf. To make the API simple, xyz_bufferSize follows almost the same signature of xyz even it only depends on some parameters, for example, device pointer is not used to decide the size of workspace. In most cases, xyz_bufferSize is called in the beginning before actual device data (pointing by a device pointer) is prepared or before the device pointer is allocated. In such case, the user can pass null pointer to xyz_bufferSize without breaking the functionality.
2.2. cuSolver Types Reference
2.2.1. cuSolverDN Types
The float, double, cuComplex, and cuDoubleComplex data types are supported. The first two are standard C data types, while the last two are exported from cuComplex.h. In addition, cuSolverDN uses some familiar types from cuBlas.
2.2.1.1. cusolverDnHandle_t
This is a pointer type to an opaque cuSolverDN context, which the user must initialize by calling cusolverDnCreate() prior to calling any other library function. An uninitialized Handle object will lead to unexpected behavior, including crashes of cuSolverDN. The handle created and returned by cusolverDnCreate() must be passed to every cuSolverDN function.
2.2.1.2. cublasFillMode_t
The type indicates which part (lower or upper) of the dense matrix was filled and consequently should be used by the function. Its values correspond to Fortran characters ‘L’ or ‘l’ (lower) and ‘U’ or ‘u’ (upper) that are often used as parameters to legacy BLAS implementations.
Value  Meaning 
CUBLAS_FILL_MODE_LOWER  the lower part of the matrix is filled 
CUBLAS_FILL_MODE_UPPER  the upper part of the matrix is filled 
2.2.1.3. cublasOperation_t
The cublasOperation_t type indicates which operation needs to be performed with the dense matrix. Its values correspond to Fortran characters ‘N’ or ‘n’ (nontranspose), ‘T’ or ‘t’ (transpose) and ‘C’ or ‘c’ (conjugate transpose) that are often used as parameters to legacy BLAS implementations.
Value  Meaning 
CUBLAS_OP_N  the nontranspose operation is selected 
CUBLAS_OP_T  the transpose operation is selected 
CUBLAS_OP_C  the conjugate transpose operation is selected 
2.2.1.4. cusolverEigType_t
The cusolverEigType_t type indicates which type of eigenvalue solver is. Its values correspond to Fortran integer 1 (A*x = lambda*B*x), 2 (A*B*x = lambda*x), 3 (B*A*x = lambda*x), used as parameters to legacy LAPACK implementations.
Value  Meaning 
CUSOLVER_EIG_TYPE_1  A*x = lambda*B*x 
CUSOLVER_EIG_TYPE_2  A*B*x = lambda*x 
CUSOLVER_EIG_TYPE_3  B*A*x = lambda*x 
2.2.1.5. cusolverEigMode_t
The cusolverEigMode_t type indicates whether or not eigenvectors are computed. Its values correspond to Fortran character 'N' (only eigenvalues are computed), 'V' (both eigenvalues and eigenvectors are computed) used as parameters to legacy LAPACK implementations.
Value  Meaning 
CUSOLVER_EIG_MODE_NOVECTOR  only eigenvalues are computed 
CUSOLVER_EIG_MODE_VECTOR  both eigenvalues and eigenvectors are computed 
2.2.1.6. cusolverIRSRefinement_t
The cusolverIRSRefinement_t type indicates which solver type would be used for the specific cusolver function. Most of our experimentation shows that CUSOLVER_IRS_REFINE_GMRES is the best option.
More details about the refinement process can be found in Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixedprecision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 47, 11 pages.
Value  Meaning 
CUSOLVER_IRS_REFINE_NOT_SET  Solver is not set. Default value. 
CUSOLVER_IRS_REFINE_NONE  No solver 
CUSOLVER_IRS_REFINE_CLASSICAL  Classical iterative refinement solver. Similar to the one used in LAPACK routines. 
CUSOLVER_IRS_REFINE_GMRES  GMRES (Generalized Minimal Residual) based iterative refinement solver. In recent study, the GMRES method has drawn the scientific community attention for its ability to be used as refinement solver that outperforms the classical iterative refinement method. 
CUSOLVER_IRS_REFINE_CLASSICAL_GMRES  Classical iterative refinement solver that uses the GMRES (Generalized Minimal Residual) internally to solve the correction equation at each iteration. We call the classical refinement iteration the outer iteration while the GMRES is called inner iteration. Note that if the tolerance of the inner GMRES is set very low, let say to machine precision, then the outer classical refinement iteration will performs only one iteration and thus this option will behaves like CUSOLVER_IRS_REFINE_GMRES. 
CUSOLVER_IRS_REFINE_GMRES_GMRES  Similar to CUSOLVER_IRS_REFINE_CLASSICAL_GMRES which is classical refinement process that uses GMRES to solve the inner correction system, here it is a GMRES (Generalized Minimal Residual) based iterative refinement solver that uses another GMRES internally to solve the preconditionned system. 
2.2.1.7. cusolverDnIRSParams_t
This is a pointer type to an opaque cusolverDnIRSParams_t structure, which holds parameters for the iterative refinement linear solvers such as cusolverDnXgesv(). Use corresponding helper functions described below to either Create/Destroy this structure or Set/Get solver parameters.
2.2.1.8. cusolverDnIRSInfos_t
This is a pointer type to an opaque cusolverDnIRSInfos_t structure, which holds information about the performed call to an iterative refinement linear solver (e.g., cusolverDnXgesv()). Use corresponding helper functions described below to either Create/Destroy this structure or retrieve solve information.
2.2.2. cuSolverSP Types
The float, double, cuComplex, and cuDoubleComplex data types are supported. The first two are standard C data types, while the last two are exported from cuComplex.h.
2.2.2.1. cusolverSpHandle_t
This is a pointer type to an opaque cuSolverSP context, which the user must initialize by calling cusolverSpCreate() prior to calling any other library function. An uninitialized Handle object will lead to unexpected behavior, including crashes of cuSolverSP. The handle created and returned by cusolverSpCreate() must be passed to every cuSolverSP function.
2.2.2.2. cusparseMatDescr_t
We have chosen to keep the same structure as exists in cuSparse to describe the shape and properties of a matrix. This enables calls to either cuSparse or cuSolver using the same matrix description.
typedef struct { cusparseMatrixType_t MatrixType; cusparseFillMode_t FillMode; cusparseDiagType_t DiagType; cusparseIndexBase_t IndexBase; } cusparseMatDescr_t;
Please read documenation of CUSPARSE Library to understand each field of cusparseMatDescr_t.
2.2.2.3. cusolverStatus_t
This is a status type returned by the library functions and it can have the following values.
CUSOLVER_STATUS_SUCCESS 
The operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED 
The cuSolver library was not initialized. This is usually caused by the lack of a prior call, an error in the CUDA Runtime API called by the cuSolver routine, or an error in the hardware setup. To correct: call cusolverCreate() prior to the function call; and check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. 
CUSOLVER_STATUS_ALLOC_FAILED 
Resource allocation failed inside the cuSolver library. This is usually caused by a cudaMalloc() failure. To correct: prior to the function call, deallocate previously allocated memory as much as possible. 
CUSOLVER_STATUS_INVALID_VALUE 
An unsupported value or parameter was passed to the function (a negative vector size, for example). To correct: ensure that all the parameters being passed have valid values. 
CUSOLVER_STATUS_ARCH_MISMATCH 
The function requires a feature absent from the device architecture; usually caused by the lack of support for atomic operations or double precision. To correct: compile and run the application on a device with compute capability 2.0 or above. 
CUSOLVER_STATUS_EXECUTION_FAILED 
The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons. To correct: check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. 
CUSOLVER_STATUS_INTERNAL_ERROR 
An internal cuSolver operation failed. This error is usually caused by a cudaMemcpyAsync() failure. To correct: check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine’s completion. 
CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED 
The matrix type is not supported by this function. This is usually caused by passing an invalid matrix descriptor to the function. To correct: check that the fields in descrA were set correctly. 
2.2.3. cuSolverRF Types
cuSolverRF only supports double.
cusolverRfHandle_t
The cusolverRfHandle_t is a pointer to an opaque data structure that contains the cuSolverRF library handle. The user must initialize the handle by calling cusolverRfCreate() prior to any other cuSolverRF library calls. The handle is passed to all other cuSolverRF library calls.
cusolverRfMatrixFormat_t
The cusolverRfMatrixFormat_t is an enum that indicates the input/output matrix format assumed by the cusolverRfSetupDevice(), cusolverRfSetupHost(), cusolverRfResetValues(), cusolveRfExtractBundledFactorsHost() and cusolverRfExtractSplitFactorsHost() routines.
Value  Meaning 
CUSOLVER_MATRIX_FORMAT_CSR  matrix format CSR is assumed. (default) 
CUSOLVER_MATRIX_FORMAT_CSC  matrix format CSC is assumed. 
cusolverRfNumericBoostReport_t
The cusolverRfNumericBoostReport_t is an enum that indicates whether numeric boosting (of the pivot) was used during the cusolverRfRefactor() and cusolverRfSolve() routines. The numeric boosting is disabled by default.
Value  Meaning 
CUSOLVER_NUMERIC_BOOST_NOT_USED  numeric boosting not used. (default) 
CUSOLVER_NUMERIC_BOOST_USED  numeric boosting used. 
cusolverRfResetValuesFastMode_t
The cusolverRfResetValuesFastMode_t is an enum that indicates the mode used for the cusolverRfResetValues() routine. The fast mode requires extra memory and is recommended only if very fast calls to cusolverRfResetValues() are needed.
Value  Meaning 
CUSOLVER_RESET_VALUES_FAST_MODE_OFF  fast mode disabled. (default) 
CUSOLVER_RESET_VALUES_FAST_MODE_ON  fast mode enabled. 
cusolverRfFactorization_t
The cusolverRfFactorization_t is an enum that indicates which (internal) algorithm is used for refactorization in the cusolverRfRefactor() routine.
Value  Meaning 
CUSOLVER_FACTORIZATION_ALG0  algorithm 0. (default) 
CUSOLVER_FACTORIZATION_ALG1  algorithm 1. 
CUSOLVER_FACTORIZATION_ALG2  algorithm 2. Dominobased scheme. 
cusolverRfTriangularSolve_t
The cusolverRfTriangularSolve_t is an enum that indicates which (internal) algorithm is used for triangular solve in the cusolverRfSolve() routine.
Value  Meaning 
CUSOLVER_TRIANGULAR_SOLVE_ALG0  algorithm 0. 
CUSOLVER_TRIANGULAR_SOLVE_ALG1  algorithm 1. (default) 
CUSOLVER_TRIANGULAR_SOLVE_ALG2  algorithm 2. Dominobased scheme. 
CUSOLVER_TRIANGULAR_SOLVE_ALG3  algorithm 3. Dominobased scheme. 
cusolverRfUnitDiagonal_t
The cusolverRfUnitDiagonal_t is an enum that indicates whether and where the unit diagonal is stored in the input/output triangular factors in the cusolverRfSetupDevice(), cusolverRfSetupHost() and cusolverRfExtractSplitFactorsHost() routines.
Value  Meaning 
CUSOLVER_UNIT_DIAGONAL_STORED_L  unit diagonal is stored in lower triangular factor. (default) 
CUSOLVER_UNIT_DIAGONAL_STORED_U  unit diagonal is stored in upper triangular factor. 
CUSOLVER_UNIT_DIAGONAL_ASSUMED_L  unit diagonal is assumed in lower triangular factor. 
CUSOLVER_UNIT_DIAGONAL_ASSUMED_U  unit diagonal is assumed in upper triangular factor. 
cusolverStatus_t
The cusolverStatus_t is an enum that indicates success or failure of the cuSolverRF library call. It is returned by all the cuSolver library routines, and it uses the same enumerated values as the sparse and dense Lapack routines.
2.3. cuSolver Formats Reference
2.3.1. Index Base Format
The CSR or CSC format requires either zerobased or onebased index for a sparse matrix A. The GLU library supports only zerobased indexing. Otherwise, both onebased and zerobased indexing are supported in cuSolver.
2.3.2. Vector (Dense) Format
The vectors are assumed to be stored linearly in memory. For example, the vector
$x=\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ \vdots \\ {x}_{\mathrm{n}}\end{array}\right)$ 
is represented as
$\left(\begin{array}{cccc}{x}_{1}& {x}_{2}& \dots & {x}_{\mathrm{n}}\end{array}\right)$ 
2.3.3. Matrix (Dense) Format
The dense matrices are assumed to be stored in columnmajor order in memory. The submatrix can be accessed using the leading dimension of the original matrix. For examle, the m*n (sub)matrix
$\left(\begin{array}{ccc}{a}_{1,1}& \dots & {a}_{1,n}\\ {a}_{2,1}& \dots & {a}_{2,n}\\ \vdots \\ {a}_{m,1}& \dots & {a}_{m,n}\end{array}\right)$ 
is represented as
$\left(\begin{array}{ccc}{a}_{1,1}& \dots & {a}_{1,n}\\ {a}_{2,1}& \dots & {a}_{2,n}\\ \vdots & \ddots & \vdots \\ {a}_{m,1}& \dots & {a}_{m,n}\\ \vdots & \ddots & \vdots \\ {a}_{\mathrm{lda},1}& \dots & {a}_{\mathrm{lda},n}\end{array}\right)$ 
with its elements arranged linearly in memory as
$\left(\begin{array}{ccccccccccccc}{a}_{1,1}& {a}_{2,1}& \dots & {a}_{m,1}& \dots & {a}_{\mathrm{lda},1}& \dots & {a}_{1,n}& {a}_{2,n}& \dots & {a}_{m,n}& \dots & {a}_{\mathrm{lda},n}\end{array}\right)$ 
where lda ≥ m is the leading dimension of A.
2.3.4. Matrix (CSR) Format
In CSR format the matrix is represented by the following parameters
parameter  type  size  Meaning 
n  (int)  the number of rows (and columns) in the matrix.  
nnz  (int)  the number of nonzero elements in the matrix.  
csrRowPtr  (int *)  n+1  the array of offsets corresponding to the start of each row in the arrays csrColInd and csrVal. This array has also an extra entry at the end that stores the number of nonzero elements in the matrix. 
csrColInd  (int *)  nnz  the array of column indices corresponding to the nonzero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. 
csrVal  (SDCZ)*  nnz  the array of values corresponding to the nonzero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. 
Note that in our CSR format sparse matrices are assumed to be stored in rowmajor order, in other words, the index arrays are first sorted by row indices and then within each row by column indices. Also it is assumed that each pair of row and column indices appears only once.
For example, the 4x4 matrix
$A=\left(\begin{array}{cccc}\mathrm{1.0}& \mathrm{3.0}& \mathrm{0.0}& \mathrm{0.0}\\ \mathrm{0.0}& \mathrm{4.0}& \mathrm{6.0}& \mathrm{0.0}\\ \mathrm{2.0}& \mathrm{5.0}& \mathrm{7.0}& \mathrm{8.0}\\ \mathrm{0.0}& \mathrm{0.0}& \mathrm{0.0}& \mathrm{9.0}\end{array}\right)$ 
is represented as
$\mathrm{csrRowPtr}=\left(\begin{array}{ccccc}0& 2& 4& 8& 9\end{array}\right)$ 
$\mathrm{csrColInd}=\left(\begin{array}{ccccccccc}0& 1& 1& 2& 0& 1& 2& 3& 3\end{array}\right)$ 
$\mathrm{csrVal}=\left(\begin{array}{ccccccccc}1.0& 3.0& 4.0& 6.0& 2.0& 5.0& 7.0& 8.0& 9.0\end{array}\right)$ 
2.3.5. Matrix (CSC) Format
In CSC format the matrix is represented by the following parameters
parameter  type  size  Meaning 
n  (int)  the number of rows (and columns) in the matrix.  
nnz  (int)  the number of nonzero elements in the matrix.  
cscColPtr  (int *)  n+1  the array of offsets corresponding to the start of each column in the arrays cscRowInd and cscVal. This array has also an extra entry at the end that stores the number of nonzero elements in the matrix. 
cscRowInd  (int *)  nnz  the array of row indices corresponding to the nonzero elements in the matrix. It is assumed that this array is sorted by column and by row within each column. 
cscVal  (SDCZ)*  nnz  the array of values corresponding to the nonzero elements in the matrix. It is assumed that this array is sorted by column and by row within each column. 
Note that in our CSC format sparse matrices are assumed to be stored in columnmajor order, in other words, the index arrays are first sorted by column indices and then within each column by row indices. Also it is assumed that each pair of row and column indices appears only once.
For example, the 4x4 matrix
$A=\left(\begin{array}{cccc}\mathrm{1.0}& \mathrm{3.0}& \mathrm{0.0}& \mathrm{0.0}\\ \mathrm{0.0}& \mathrm{4.0}& \mathrm{6.0}& \mathrm{0.0}\\ \mathrm{2.0}& \mathrm{5.0}& \mathrm{7.0}& \mathrm{8.0}\\ \mathrm{0.0}& \mathrm{0.0}& \mathrm{0.0}& \mathrm{9.0}\end{array}\right)$ 
is represented as
$\mathrm{cscColPtr}=\left(\begin{array}{ccccc}0& 2& 5& 7& 9\end{array}\right)$ 
$\mathrm{cscRowInd}=\left(\begin{array}{ccccccccc}0& 2& 0& 1& 2& 1& 2& 2& 3\end{array}\right)$ 
$\mathrm{cscVal}=\left(\begin{array}{ccccccccc}1.0& 2.0& 3.0& 4.0& 5.0& 6.0& 7.0& 8.0& 9.0\end{array}\right)$ 
cuSolverDN: dense LAPACK Function Reference
This chapter describes the API of cuSolverDN, which provides a subset of dense LAPACK functions.
cuSolverDN Helper Function Reference
The cuSolverDN helper functions are described in this section.
2.4.1.1. cusolverDnCreate()
cusolverStatus_t cusolverDnCreate(cusolverDnHandle_t *handle);
This function initializes the cuSolverDN library and creates a handle on the cuSolverDN context. It must be called before any other cuSolverDN API function is invoked. It allocates hardware resources necessary for accessing the GPU.
parameter  Memory  In/out  Meaning 
handle  host  output  the pointer to the handle to the cuSolverDN context. 
CUSOLVER_STATUS_SUCCESS  the initialization succeeded. 
CUSOLVER_STATUS_NOT_INITIALIZED  the CUDA Runtime initialization failed. 
CUSOLVER_STATUS_ALLOC_FAILED  the resources could not be allocated. 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
2.4.1.2. cusolverDnDestroy()
cusolverStatus_t cusolverDnDestroy(cusolverDnHandle_t handle);
This function releases CPUside resources used by the cuSolverDN library.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
CUSOLVER_STATUS_SUCCESS  the shutdown succeeded. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
cusolverDnSetStream()
cusolverStatus_t cusolverDnSetStream(cusolverDnHandle_t handle, cudaStream_t streamId)
This function sets the stream to be used by the cuSolverDN library to execute its routines.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
streamId  host  input  the stream to be used by the library. 
CUSOLVER_STATUS_SUCCESS  the stream was set successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
cusolverDnGetStream()
cusolverStatus_t cusolverDnGetStream(cusolverDnHandle_t handle, cudaStream_t *streamId)
This function sets the stream to be used by the cuSolverDN library to execute its routines.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
streamId  host  output  the stream to be used by the library. 
CUSOLVER_STATUS_SUCCESS  the stream was set successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
2.4.1.5. cusolverDnCreateSyevjInfo()
cusolverStatus_t cusolverDnCreateSyevjInfo( syevjInfo_t *info);
This function creates and initializes the structure of syevj, syevjBatched and sygvj to default values.
parameter  Memory  In/out  Meaning 
info  host  output  the pointer to the structure of syevj. 
CUSOLVER_STATUS_SUCCESS  the structure was initialized successfully. 
CUSOLVER_STATUS_ALLOC_FAILED  the resources could not be allocated. 
2.4.1.6. cusolverDnDestroySyevjInfo()
cusolverStatus_t cusolverDnDestroySyevjInfo( syevjInfo_t info);
This function destroys and releases any memory required by the structure.
parameter  Memory  In/out  Meaning 
info  host  input  the structure of syevj. 
CUSOLVER_STATUS_SUCCESS  the resources are released successfully. 
2.4.1.7. cusolverDnXsyevjSetTolerance()
cusolverStatus_t cusolverDnXsyevjSetTolerance( syevjInfo_t info, double tolerance)
This function configures tolerance of syevj.
parameter  Memory  In/out  Meaning 
info  host  in/out  the pointer to the structure of syevj. 
tolerance  host  input  accuracy of numerical eigenvalues. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
2.4.1.8. cusolverDnXsyevjSetMaxSweeps()
cusolverStatus_t cusolverDnXsyevjSetMaxSweeps( syevjInfo_t info, int max_sweeps)
This function configures maximum number of sweeps in syevj. The default value is 100.
parameter  Memory  In/out  Meaning 
info  host  in/out  the pointer to the structure of syevj. 
max_sweeps  host  input  maximum number of sweeps. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
2.4.1.9. cusolverDnXsyevjSetSortEig()
cusolverStatus_t cusolverDnXsyevjSetSortEig( syevjInfo_t info, int sort_eig)
if sort_eig is zero, the eigenvalues are not sorted. This function only works for syevjBatched. syevj and sygvj always sort eigenvalues in ascending order. By default, eigenvalues are always sorted in ascending order.
parameter  Memory  In/out  Meaning 
info  host  in/out  the pointer to the structure of syevj. 
sort_eig  host  input  if sort_eig is zero, the eigenvalues are not sorted. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
2.4.1.10. cusolverDnXsyevjGetResidual()
cusolverStatus_t cusolverDnXsyevjGetResidual( cusolverDnHandle_t handle, syevjInfo_t info, double *residual)
This function reports residual of syevj or sygvj. It does not support syevjBatched. If the user calls this function after syevjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
info  host  input  the pointer to the structure of syevj. 
residual  host  output  residual of syevj. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_SUPPORTED  does not support batched version 
2.4.1.11. cusolverDnXsyevjGetSweeps()
cusolverStatus_t cusolverDnXsyevjGetSweeps( cusolverDnHandle_t handle, syevjInfo_t info, int *executed_sweeps)
This function reports number of executed sweeps of syevj or sygvj. It does not support syevjBatched. If the user calls this function after syevjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
info  host  input  the pointer to the structure of syevj. 
executed_sweeps  host  output  number of executed sweeps. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_SUPPORTED  does not support batched version 
2.4.1.12. cusolverDnCreateGesvdjInfo()
cusolverStatus_t cusolverDnCreateGesvdjInfo( gesvdjInfo_t *info);
This function creates and initializes the structure of gesvdj and gesvdjBatched to default values.
parameter  Memory  In/out  Meaning 
info  host  output  the pointer to the structure of gesvdj. 
CUSOLVER_STATUS_SUCCESS  the structure was initialized successfully. 
CUSOLVER_STATUS_ALLOC_FAILED  the resources could not be allocated. 
2.4.1.13. cusolverDnDestroyGesvdjInfo()
cusolverStatus_t cusolverDnDestroyGesvdjInfo( gesvdjInfo_t info);
This function destroys and releases any memory required by the structure.
parameter  Memory  In/out  Meaning 
info  host  input  the structure of gesvdj. 
CUSOLVER_STATUS_SUCCESS  the resources are released successfully. 
2.4.1.14. cusolverDnXgesvdjSetTolerance()
cusolverStatus_t cusolverDnXgesvdjSetTolerance( gesvdjInfo_t info, double tolerance)
This function configures tolerance of gesvdj.
parameter  Memory  In/out  Meaning 
info  host  in/out  the pointer to the structure of gesvdj. 
tolerance  host  input  accuracy of numerical singular values. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
2.4.1.15. cusolverDnXgesvdjSetMaxSweeps()
cusolverStatus_t cusolverDnXgesvdjSetMaxSweeps( gesvdjInfo_t info, int max_sweeps)
This function configures maximum number of sweeps in gesvdj. The default value is 100.
parameter  Memory  In/out  Meaning 
info  host  in/out  the pointer to the structure of gesvdj. 
max_sweeps  host  input  maximum number of sweeps. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
2.4.1.16. cusolverDnXgesvdjSetSortEig()
cusolverStatus_t cusolverDnXgesvdjSetSortEig( gesvdjInfo_t info, int sort_svd)
if sort_svd is zero, the singular values are not sorted. This function only works for gesvdjBatched. gesvdj always sorts singular values in descending order. By default, singular values are always sorted in descending order.
parameter  Memory  In/out  Meaning 
info  host  in/out  the pointer to the structure of gesvdj. 
sort_svd  host  input  if sort_svd is zero, the singular values are not sorted. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
2.4.1.17. cusolverDnXgesvdjGetResidual()
cusolverStatus_t cusolverDnXgesvdjGetResidual( cusolverDnHandle_t handle, gesvdjInfo_t info, double *residual)
This function reports residual of gesvdj. It does not support gesvdjBatched. If the user calls this function after gesvdjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
info  host  input  the pointer to the structure of gesvdj. 
residual  host  output  residual of gesvdj. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_SUPPORTED  does not support batched version 
2.4.1.18. cusolverDnXgesvdjGetSweeps()
cusolverStatus_t cusolverDnXgesvdjGetSweeps( cusolverDnHandle_t handle, gesvdjInfo_t info, int *executed_sweeps)
This function reports number of executed sweeps of gesvdj. It does not support gesvdjBatched. If the user calls this function after gesvdjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
info  host  input  the pointer to the structure of gesvdj. 
executed_sweeps  host  output  number of executed sweeps. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_SUPPORTED  does not support batched version 
2.4.1.19. cusolverDnIRSParamsCreate()
cusolverStatus_t cusolverDnIRSParamsCreate(cusolverDnIRSParams_t *params);
This function creates and initializes the structure of parameters for an IRS solver such as the cusolverDnIRSXgesv() function to default values. For this release, this function is valid for only one call to an IRS solver, thus each new call to the IRS solver will requires its own Params structure. This restriction is going to be removed in future release and then if user want to reuse the same configuration to many call to an IRS solver it will allow him.
parameter  Memory  In/out  Meaning 
params  host  output  Pointer to the cusolverDnIRSParams_t Params structure 
CUSOLVER_STATUS_SUCCESS  The structure was created and initialized successfully. 
CUSOLVER_STATUS_ALLOC_FAILED  The resources could not be allocated. 
2.4.1.20. cusolverDnIRSParamsDestroy()
cusolverStatus_t cusolverDnIRSParamsDestroy(cusolverDnIRSParams_t params);
This function destroys and releases any memory required by the Params structure. Since the Infos structure (see cusolverDnIRSInfosCreate() for more details) depends on the Params structure, this function cannot be called to destroy the Params structure if the Infos structure was not destroyed. For this release, this function is valid for only one call to an IRS solver, thus each call to an IRS solver should have its own Params and Infos structure. This restriction is going to be removed in future release and then if user want to reuse the same configuration to many call to an IRS solver it will allow him.
parameter  Memory  In/out  Meaning 
params  host  input  The cusolverDnIRSParams_t Params structure 
CUSOLVER_STATUS_SUCCESS  The resources are released successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
CUSOLVER_STATUS_IRS_INFOS_NOT_DESTROYED  Not all the Infos structure associated with this Params structure have been destroyed yet. 
2.4.1.21. cusolverDnIRSParamsSetRefinementSolver()
cusolverStatus_t cusolverDnIRSParamsSetRefinementSolver( cusolverDnIRSParams_t params, cusolverIRSRefinement_t solver);
This function sets the refinement solver to be used in the Iterative Refinement Solver functions such as the cusolverDnIRSXgesv() function. Details about values that can be set to and their meaning can be found in the cusolverIRSRefinement_t type section.
parameter  Memory  In/out  Meaning 
params  host  in/out  The cusolverDnIRSParams_t Params structure 
solver  host  input  Type of the refinement solver to be used by the IRS solver such as cusolverDnIRSXgesv() 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
2.4.1.22. cusolverDnIRSParamsSetSolverMainPrecision()
cusolverStatus_t cusolverDnIRSParamsSetSolverMainPrecision( cusolverDnIRSParams_t params, cudaDataType solver_main_precision);
This function sets the main precision (e.g., the INOUT data type) of the IRS solver. The value set here should be the same cuda datatype as the third argument on the call to the IRS solver. Note that, the user has to set the main precision before a first call to the IRS solver because it is NOT set by default with the Params creation. He can set it by either calling this function or cusolverDnIRSParamsSetSolverPrecisions(). Possible values are described in the table of the corresponding IRS solver for example, see the description of the third argument of the cusolverDnIRSXgesv() IRS function.
parameter  Memory  In/out  Meaning 
params  host  in/out  The cusolverDnIRSParams_t Params structure 
solver_main_precision  host  input  Allowed cuda datatype (for example CUDA_R_FP64). See the corresponding IRS solver for the table of supported precisions. 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
2.4.1.23. cusolverDnIRSParamsSetSolverLowestPrecision()
cusolverStatus_t cusolverDnIRSParamsSetSolverLowestPrecision( cusolverDnIRSParams_t params, cudaDataType lowest_precision_type);
This function sets the lowest precision that will be used by Iterative Refinement Solver. Note that, the user has to set the lowest precision before a first call to the IRS solver because it is NOT set by default with the Params creation. He can set it by either calling this function or cusolverDnIRSParamsSetSolverPrecisions(). Usually this precision define the speedup that can be achieved. The ratio of the performance of the lowest precision over the inout_data_type precision define somehow the upper bound of the speedup that could be obtained. More precisely, it depends of many factors, but for large matrices sizes, it is the ratio of the matrixmatrix rankk product (e.g., rankk GEMM where k is around 256) that define the possible speedup. For instance, if the inout precision is real double precision FP64 and the lowest precision is FP32, then we can expect a speedup of at most 2X for large problem sizes. If the lowest precision was FP16, then we can expect 3X4X. A reasonable strategy should take the number of righthand sides and the size of the matrix as well as the convergence rate into account.
parameter  Memory  In/out  Meaning 
params  host  in/out  The cusolverDnIRSParams_t Params structure 
lowest_precision_type  host  input  Allowed cuda datatype (for example CUDA_R_FP16). See the corresponding IRS solver for the table of supported precisions. 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
2.4.1.24. cusolverDnIRSParamsSetSolverPrecisions()
cusolverStatus_t cusolverDnIRSParamsSetSolverPrecisions( cusolverDnIRSParams_t params, cudaDataType solver_main_precision, cudaDataType solver_lowest_precision );
This function set both, the main and the lowest precision of the IRS solver. It is a wrappers to both cusolverDnIRSParamsSetSolverMainPrecision() and cusolverDnIRSParamsSetSolverLowestPrecision(). Note that, the user has to set both the main and the lowest precision before a first call to the IRS solver because they are NOT set by default with the Params creation. He can set it by either calling this function or both functions cusolverDnIRSParamsSetSolverLowestPrecisions() and cusolverDnIRSParamsSetSolverMainPrecisions().
parameter  Memory  In/out  Meaning 
params  host  in/out  The cusolverDnIRSParams_t Params structure 
solver_main_precision  host  input  Allowed cuda datatype (for example CUDA_R_FP64). See the corresponding IRS solver for the table of supported precisions. 
solver_lowest_precision  host  input  Allowed cuda datatype (for example CUDA_R_FP16). See the corresponding IRS solver for the table of supported precisions. 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
2.4.1.25. cusolverDnIRSParamsSetTol()
cusolverStatus_t cusolverDnIRSParamsSetTol( cusolverDnIRSParams_t params, cudaDataType data_type, double val );
This function sets the tolerance for the refinement solver. By default it is set such that, all the RHS satisfy:
 RNRM is the infinitynorm of the residual
 XNRM is the infinitynorm of the solution
 ANRM is the infinityoperatornorm of the matrix A
 EPS is the machine epsilon that matches LAPACK <t1>LAMCH('Epsilon')
The user can use this function to change the tolerance to a lower or higher value. Our goal is to give the user as much control as we can such a way he can investigate and control every detail of the IRS solver.
parameter  Memory  In/out  Meaning 
params  host  in/out  The cusolverDnIRSParams_t Params structure 
data_type  host  input  cuda datatype of the inout_data_type 
val  host  input  double precision value to which the refinement tolerance will be set internally. 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
2.4.1.26. cusolverDnIRSParamsSetTolInner()
cusolverStatus_t cusolverDnIRSParamsSetTolInner( cusolverDnIRSParams_t params, cudaDataType data_type, double val );
This function sets the tolerance for the inner refinement solver when the refinement solver consists of two level solvers (e.g., CUSOLVER_IRS_REFINE_CLASSICAL_GMRES or CUSOLVER_IRS_REFINE_GMRES_GMRES). It is not referenced in case of one level refinement solver such as CUSOLVER_IRS_REFINE_CLASSICAL or CUSOLVER_IRS_REFINE_GMRES. This function set the tolerance for the inner solver (e.g. the inner GMRES). For example, if the RefinementSolver was set to CUSOLVER_IRS_REFINE_CLASSICAL_GMRES setting this tolerance mean that the inner GMRES solver will converge to that tolerance at each outer iteration of the classical refinement solver. It is set to 1e4 by default. Our goal is to give the user as much control as we can such a way he can investigate and control every detail of the IRS solver.
parameter  Memory  In/out  Meaning 
params  host  in/out  The cusolverDnIRSParams_t Params structure 
data_type  host  input  The cuda datatype (for example CUDA_R_FP64) of the inout data 
val  host  input  Double precision real value to which the refinement tolerance will be set internally. 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
2.4.1.27. cusolverDnIRSParamsSetMaxIters()
cusolverStatus_t cusolverDnIRSParamsSetMaxIters( cusolverDnIRSParams_t params, int max_iters);
This function sets the total number of allowed refinement iterations after which solver will stop. Total means the maximum number of iterations allowed (e.g., outer and inner iterations when two level refinement solver is set). Default value is set to 50. Our goal is to give the user as much control as we can such a way he can investigate and control every detail of the IRS solver.
parameter  Memory  In/out  Meaning 
params  host  in/out  The cusolverDnIRSParams_t Params structure 
max_iters  host  input  Maximum total number of iterations allowed for the refinement solver 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
2.4.1.28. cusolverDnIRSParamsSetMaxItersInner()
cusolverStatus_t cusolverDnIRSParamsSetMaxItersInner( cusolverDnIRSParams_t params, cusolver_int_t maxiters_inner );
This function sets the maximal number of iterations allowed for the inner refinement solver. It is not referenced in case of one level refinement solver such as CUSOLVER_IRS_REFINE_CLASSICAL or CUSOLVER_IRS_REFINE_GMRES. The inner refinement solver will stop after reaching either the inner tolerance or the MaxItersInner value. By default, it is set to MaxIters (e.g., 50). Note that this value could not be larger than MaxIters since MaxIters is the total number of allowed iterations. Note that, if user call to set MaxIters after calling this function, the MaxIters has priority and will overwrite MaxItersInner to the minimum value of (MaxIters, MaxItersInner). Our goal is to give the user as much control as we can such a way he can investigate and control every detail of the IRS solver.
parameter  Memory  In/out  Meaning 
params  host  in/out  The cusolverDnIRSParams_t Params structure 
maxiters_inner  host  input  Maximum number of allowed inner iterations for the refinement solver. Meaningful when the refinement solver is a two levels solver such as CUSOLVER_IRS_REFINE_CLASSICAL_GMRES or CUSOLVER_IRS_REFINE_GMRES_GMRES. Value should be less or equal to MaxIters. 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
CUSOLVER_STATUS_IRS_PARAMS_INVALID  if the value was larger than MaxIters. 
2.4.1.29. cusolverDnIRSParamsGetMaxIters()
cusolverStatus_t cusolverDnIRSParamsGetMaxIters( cusolverDnIRSParams_t params, cusolver_int_t *maxiters );
This function returns the maximal number of iterations MaxIters that is currently set within the current Params structure. Thus, it returns the value of the MaxIters parameter.
parameter  Memory  In/out  Meaning 
params  host  in  The cusolverDnIRSParams_t Params structure 
maxiters  host  output  The maximal number of iterations that is currently set 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
2.4.1.30. cusolverDnIRSInfosCreate()
cusolverStatus_t cusolverDnIRSInfosCreate( cusolverDnIRSParams_t params, cusolverDnIRSInfos_t* infos )
This function creates and initializes the Infos structure that will hold the refinement informations of an IRS solver. Such information includes the total number of iterations needed to converge (Niters), the outer number of iterations, and a pointer to the matrix of the convergence history residual norms. This function need to be called after the Params structure (see cusolverDnIRSParamsCreate()) has been created and before the call to an IRS solver such as cusolverDnIRSXgesv(). This function is valid for only one call to an IRS solver, since it hold info about that solve and thus each solve will requires its own Infos structure.
parameter  Memory  In/out  Meaning 
params  host  in  The cusolverDnIRSParams_t Params structure 
info  host  output  Pointer to the cusolverDnIRSInfos_t Infos structure 
CUSOLVER_STATUS_SUCCESS  The structure was initialized successfully. 
CUSOLVER_STATUS_ALLOC_FAILED  The resources could not be allocated. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
2.4.1.31. cusolverDnIRSInfosDestroy()
cusolverStatus_t cusolverDnIRSInfosDestroy( cusolverDnIRSParams_t params, cusolverDnIRSInfos_t infos );
This function destroys and releases any memory required by the Infos structure. This function destroy all the informations (e.g., Niters performed, OuterNiters performed, residual history etc) about a solver call, thus a user is supposed to call it once he is done from the informations he need.
parameter  Memory  In/out  Meaning 
params  host  in/out  The cusolverDnIRSParams_t Params structure 
info  host  in/out  The cusolverDnIRSInfos_t Infos structure 
CUSOLVER_STATUS_SUCCESS  the resources are released successfully. 
CUSOLVER_STATUS_IRS_INFOS_NOT_INITIALIZED  The Infos structure was not created. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
2.4.1.32. cusolverDnIRSInfosGetMaxIters()
cusolverStatus_t cusolverDnIRSInfosGetMaxIters( cusolverDnIRSParams_t params, cusolverDnIRSInfos_t infos, cusolver_int_t *maxiters );
This function returns the maximal number of iterations that is currently set within the current Params structure. This function is a duplicate of cusolverDnIRSParamsGetMaxIters().
parameter  Memory  In/out  Meaning 
params  host  in  The cusolverDnIRSParams_t Params structure 
infos  host  in  The cusolverDnIRSInfos_t Infos structure 
maxiters  host  output  The maximal number of iterations that is currently set 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
CUSOLVER_STATUS_IRS_INFOS_NOT_INITIALIZED  The Infos structure was not created. 
2.4.1.33. cusolverDnIRSInfosGetNiters()
cusolverStatus_t cusolverDnIRSInfosGetNiters( cusolverDnIRSParams_t params, cusolverDnIRSInfos_t infos, cusolver_int_t *niters );
This function returns the total number of iterations performed by the IRS solver. If it was negative it means that the IRS solver had numerical issues a fall back to a full precision solution most like happened. Please refer to the description of negative niters values in the corresponding IRS linear solver functions such as cusolverDnXgesv().
parameter  Memory  In/out  Meaning 
params  host  in  The cusolverDnIRSParams_t Params structure 
infos  host  in  The cusolverDnIRSInfos_t Infos structure 
niters  host  output  The total number of iterations performed by the IRS solver 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
CUSOLVER_STATUS_IRS_INFOS_NOT_INITIALIZED  The Infos structure was not created. 
2.4.1.34. cusolverDnIRSInfosGetOuterNiters()
cusolverStatus_t cusolverDnIRSInfosGetOuterNiters( cusolverDnIRSParams_t params, cusolverDnIRSInfos_t infos, cusolver_int_t *outer_niters );
This function returns the number of iterations performed by outer refinement loop of the IRS solver. When RefinementSolver consists of one level solver such as CUSOLVER_IRS_REFINE_CLASSICAL or CUSOLVER_IRS_REFINE_GMRES, it is the same as Niters. When RefinementSolver consists of two levels solver such as CUSOLVER_IRS_REFINE_CLASSICAL_GMRES or CUSOLVER_IRS_REFINE_GMRES_GMRES, it is the number of the outer iteration. See description of cusolverIRSRefinementSolver_t type section for more details.
parameter  Memory  In/out  Meaning 
params  host  in  The cusolverDnIRSParams_t Params structure 
infos  host  in  The cusolverDnIRSInfos_t Infos structure 
outer_niters  host  output  The number of iterations of the outer refinement loop of the IRS solver 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
CUSOLVER_STATUS_IRS_INFOS_NOT_INITIALIZED  The Infos structure was not created. 
2.4.1.35. cusolverDnIRSInfosRequestResidual()
cusolverStatus_t cusolverDnIRSInfosRequestResidual( cusolverDnIRSParams_t params, cusolverDnIRSInfos_t infos );
This function, once called, tell the IRS solver to store the convergence history of the refinement phase in a matrix, that could be accessed via a pointer returned by the cusolverDnIRSInfosGetResidualHistory() function.
parameter  Memory  In/out  Meaning 
params  host  in  The cusolverDnIRSParams_t Params structure 
infos  host  in  The cusolverDnIRSInfos_t Infos structure 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
CUSOLVER_STATUS_IRS_INFOS_NOT_INITIALIZED  The Infos structure was not created. 
2.4.1.36. cusolverDnIRSInfosGetResidualHistory()
cusolverStatus_t cusolverDnIRSInfosGetResidualHistory( cusolverDnIRSParams_t params, cusolverDnIRSInfos_t infos, void **residual_history );
If the user called cusolverDnIRSInfosRequestResidual() before the call to the IRS solve function, this function return a pointer to the matrix of the convergence history residual norms. Precision of residual norms depends on the IRS input data type. If inout datatype has double precision (CUDA_R_FP64 or CUDA_C_FP64 inout datatype), this residual will be real double precision. Otherwise (CUDA_R_FP32 or CUDA_C_FP32 inout datatype)  residual will be with real single precision.
The residual history matrix consists of two columns (even for NRHS case) of MaxIters+1 rows, thus a matrix of size (MaxIters+1,2). Only the first OuterNiters+1 rows contains the needed informations the other (e.g., OuterNiters+2:Maxiters+1) are garbage. On the first column, each row "i" specify the total number of iterations happened at this outer iteration "i" and on the second columns the residual norm corresponding to this outer iteration "i". Thus, the first row (e.g., outer iteration "0") consists of the initial residual (e.g., the residual before the refinement loop start) then the consecutive OuterNiters rows are the residual obtained at each outer iteration of the refinement loop. Note, it only consists of the history of the outer loop.
Thus, if the Refinementsolver was CUSOLVER_IRS_REFINE_CLASSICAL or CUSOLVER_IRS_REFINE_GMRES, then OuterNiters=Niters (Niters is the total number of iterations performed) and there is Niters+1 rows of norms that correspond to the Niters outer iterations.
If the Refinementsolver was CUSOLVER_IRS_REFINE_CLASSICAL_GMRES or CUSOLVER_IRS_REFINE_GMRES_GMRES, then OuterNiters <= Niters corresponds to the outer iterations performed by the outer refinement loop. Thus there is OuterNiters+1 residual norms where row "i" correspond to the outer iteration "i" and the first column specify the total number of iterations (outer and inner) that were performed while the second columns correspond to the residual norm at this stage.
For example, let say the user specify CUSOLVER_IRS_REFINE_CLASSICAL_GMRES as a Refinementsolver and let say it needed 3 outer iterations to converge and 4,3,3 inner iterations at each outer respectively. This consists of 10 total iterations. Thus on row 0 is for the first residual before the refinement start, so it has 0 iteration. On row 1 which correspond to the outer iteration 1, it will be shown 4 (4 is the total number of iterations that were performed till now) on row 2, it will be 7 and on row 3 it will be 10.
As summary, let define ldh=Maxiters+1, the leading dimension of the residual matrix. then residual_history[i] shows the total number of iterations performed at the outer iteration "i" and residual_history[i+ldh] correspond to its norm of the residual at this stage.
parameter  Memory  In/out  Meaning 
params  host  in  The cusolverDnIRSParams_t Params structure 
infos  host  in  The cusolverDnIRSInfos_t Infos structure 
residual_history  host  output  Returns a void pointer to the matrix of the convergence history residual norms. See the description above for the relation between the residual norm types and the inout datatype. 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_IRS_PARAMS_NOT_INITIALIZED  The Params structure was not created. 
CUSOLVER_STATUS_IRS_INFOS_NOT_INITIALIZED  The Infos structure was not created. 
CUSOLVER_STATUS_INVALID_VALUE  This function was called without calling cusolverDnIRSInfosRequestResidual() in advance. 
Dense Linear Solver Reference
This chapter describes linear solver API of cuSolverDN, including Cholesky factorization, LU with partial pivoting, QR factorization and BunchKaufman (LDLT) factorization.
cusolverDn<t>potrf()
cusolverStatus_t cusolverDnSpotrf_bufferSize(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnDpotrf_bufferSize(cusolveDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnCpotrf_bufferSize(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnZpotrf_bufferSize(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, int *Lwork);
cusolverStatus_t cusolverDnSpotrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, float *Workspace, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnDpotrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, double *Workspace, int Lwork, int *devInfo );
cusolverStatus_t cusolverDnCpotrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, cuComplex *Workspace, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnZpotrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, cuDoubleComplex *Workspace, int Lwork, int *devInfo );
This function computes the Cholesky factorization of a Hermitian positivedefinite matrix.
A is a n×n Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo indicates which part of the matrix is used. The function would leave other part untouched.
If input parameter uplo is CUBLAS_FILL_MODE_LOWER, only lower triangular part of A is processed, and replaced by lower triangular Cholesky factor L.
$A=L*{L}^{H}$ 
If input parameter uplo is CUBLAS_FILL_MODE_UPPER, only upper triangular part of A is processed, and replaced by upper triangular Cholesky factor U.
$A={U}^{H}*U$ 
The user has to provide working space which is pointed by input parameter Workspace. The input parameter Lwork is size of the working space, and it is returned by potrf_bufferSize().
If Cholesky factorization failed, i.e. some leading minor of A is not positive definite, or equivalently some diagonal elements of L or U is not a real number. The output parameter devInfo would indicate smallest leading minor of A which is not positive definite.
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
uplo  host  input  indicates if matrix A lower or upper part is stored, the other part is not referenced. 
n  host  input  number of rows and columns of matrix A. 
A  device  in/out  <type> array of dimension lda * n with lda is not less than max(1,n). 
lda  host  input  leading dimension of twodimensional array used to store matrix A. 
Workspace  device  in/out  working space, <type> array of size Lwork. 
Lwork  host  input  size of Workspace, returned by potrf_bufferSize. 
devInfo  device  output  if devInfo = 0, the Cholesky factorization is successful. if devInfo = i, the ith parameter is wrong (not counting handle). if devInfo = i, the leading minor of order i is not positive definite. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (n<0 or lda<max(1,n)). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>potrs()
cusolverStatus_t cusolverDnSpotrs(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, const float *A, int lda, float *B, int ldb, int *devInfo); cusolverStatus_t cusolverDnDpotrs(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, const double *A, int lda, double *B, int ldb, int *devInfo); cusolverStatus_t cusolverDnCpotrs(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, const cuComplex *A, int lda, cuComplex *B, int ldb, int *devInfo); cusolverStatus_t cusolverDnZpotrs(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, const cuDoubleComplex *A, int lda, cuDoubleComplex *B, int ldb, int *devInfo);
This function solves a system of linear equations
$A*X=B$ 
where A is a n×n Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo indicates which part of the matrix is used. The function would leave other part untouched.
The user has to call potrf first to factorize matrix A. If input parameter uplo is CUBLAS_FILL_MODE_LOWER, A is lower triangular Cholesky factor L correspoding to $A=L*{L}^{H}$ . If input parameter uplo is CUBLAS_FILL_MODE_UPPER, A is upper triangular Cholesky factor U corresponding to $A={U}^{H}*U$ .
The operation is inplace, i.e. matrix X overwrites matrix B with the same leading dimension ldb.
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolveDN library context. 
uplo  host  input  indicates if matrix A lower or upper part is stored, the other part is not referenced. 
n  host  input  number of rows and columns of matrix A. 
nrhs  host  input  number of columns of matrix X and B. 
A  device  input  <type> array of dimension lda * n with lda is not less than max(1,n). A is either lower cholesky factor L or upper Cholesky factor U. 
lda  host  input  leading dimension of twodimensional array used to store matrix A. 
B  device  in/out  <type> array of dimension ldb * nrhs. ldb is not less than max(1,n). As an input, B is right hand side matrix. As an output, B is the solution matrix. 
devInfo  device  output  if devInfo = 0, the Cholesky factorization is successful. if devInfo = i, the ith parameter is wrong (not counting handle). 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (n<0, nrhs<0, lda<max(1,n) or ldb<max(1,n)). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>potri()
cusolverStatus_t cusolverDnSpotri_bufferSize(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnDpotri_bufferSize(cusolveDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnCpotri_bufferSize(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnZpotri_bufferSize(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, int *Lwork);
cusolverStatus_t cusolverDnSpotri(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, float *Workspace, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnDpotri(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, double *Workspace, int Lwork, int *devInfo );
cusolverStatus_t cusolverDnCpotri(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, cuComplex *Workspace, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnZpotri(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, cuDoubleComplex *Workspace, int Lwork, int *devInfo );
This function computes the inverse of a positivedefinite matrix A using the Cholesky factorization
$A=L*{L}^{H}={U}^{H}*U$ 
computed by potrf().
A is a n×n matrix containing the triangular factor L or U computed by the Cholesky factorization. Only lower or upper part is meaningful and the input parameter uplo indicates which part of the matrix is used. The function would leave the other part untouched.
If the input parameter uplo is CUBLAS_FILL_MODE_LOWER, only lower triangular part of A is processed, and replaced the by lower triangular part of the inverse of A.
If the input parameter uplo is CUBLAS_FILL_MODE_UPPER, only upper triangular part of A is processed, and replaced by the upper triangular part of the inverse of A.
The user has to provide the working space which is pointed to by input parameter Workspace. The input parameter Lwork is the size of the working space, returned by potri_bufferSize().
If the computation of the inverse fails, i.e. some leading minor of L or U, is null, the output parameter devInfo would indicate the smallest leading minor of L or U which is not positive definite.
If the output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting the handle).
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
uplo  host  input  indicates if matrix A lower or upper part is stored, the other part is not referenced. 
n  host  input  number of rows and columns of matrix A. 
A  device  in/out  <type> array of dimension lda * n where lda is not less than max(1,n). 
lda  host  input  leading dimension of twodimensional array used to store matrix A. 
Workspace  device  in/out  working space, <type> array of size Lwork. 
Lwork  host  input  size of Workspace, returned by potri_bufferSize. 
devInfo  device  output  if devInfo = 0, the computation of the inverse is successful. if devInfo = i, the ith parameter is wrong (not counting handle). if devInfo = i, the leading minor of order i is zero. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (n<0 or lda<max(1,n)). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>getrf()
cusolverStatus_t cusolverDnSgetrf_bufferSize(cusolverDnHandle_t handle, int m, int n, float *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnDgetrf_bufferSize(cusolverDnHandle_t handle, int m, int n, double *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnCgetrf_bufferSize(cusolverDnHandle_t handle, int m, int n, cuComplex *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnZgetrf_bufferSize(cusolverDnHandle_t handle, int m, int n, cuDoubleComplex *A, int lda, int *Lwork );
cusolverStatus_t cusolverDnSgetrf(cusolverDnHandle_t handle, int m, int n, float *A, int lda, float *Workspace, int *devIpiv, int *devInfo ); cusolverStatus_t cusolverDnDgetrf(cusolverDnHandle_t handle, int m, int n, double *A, int lda, double *Workspace, int *devIpiv, int *devInfo );
cusolverStatus_t cusolverDnCgetrf(cusolverDnHandle_t handle, int m, int n, cuComplex *A, int lda, cuComplex *Workspace, int *devIpiv, int *devInfo ); cusolverStatus_t cusolverDnZgetrf(cusolverDnHandle_t handle, int m, int n, cuDoubleComplex *A, int lda, cuDoubleComplex *Workspace, int *devIpiv, int *devInfo );
This function computes the LU factorization of a m×n matrix
$P*A=L*U$ 
where A is a m×n matrix, P is a permutation matrix, L is a lower triangular matrix with unit diagonal, and U is an upper triangular matrix.
The user has to provide working space which is pointed by input parameter Workspace. The input parameter Lwork is size of the working space, and it is returned by getrf_bufferSize().
If LU factorization failed, i.e. matrix A (U) is singular, The output parameter devInfo=i indicates U(i,i) = 0.
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
If devIpiv is null, no pivoting is performed. The factorization is A=L*U, which is not numerically stable.
No matter LU factorization failed or not, the output parameter devIpiv contains pivoting sequence, row i is interchanged with row devIpiv(i).
The user can combine getrf and getrs to complete a linear solver. Please refer to appendix D.1.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
m  host  input  number of rows of matrix A. 
n  host  input  number of columns of matrix A. 
A  device  in/out  <type> array of dimension lda * n with lda is not less than max(1,m). 
lda  host  input  leading dimension of twodimensional array used to store matrix A. 
Workspace  device  in/out  working space, <type> array of size Lwork. 
devIpiv  device  output  array of size at least min(m,n), containing pivot indices. 
devInfo  device  output  if devInfo = 0, the LU factorization is successful. if devInfo = i, the ith parameter is wrong (not counting handle). if devInfo = i, the U(i,i) = 0. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (m,n<0 or lda<max(1,m)). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>getrs()
cusolverStatus_t cusolverDnSgetrs(cusolverDnHandle_t handle, cublasOperation_t trans, int n, int nrhs, const float *A, int lda, const int *devIpiv, float *B, int ldb, int *devInfo ); cusolverStatus_t cusolverDnDgetrs(cusolverDnHandle_t handle, cublasOperation_t trans, int n, int nrhs, const double *A, int lda, const int *devIpiv, double *B, int ldb, int *devInfo ); cusolverStatus_t cusolverDnCgetrs(cusolverDnHandle_t handle, cublasOperation_t trans, int n, int nrhs, const cuComplex *A, int lda, const int *devIpiv, cuComplex *B, int ldb, int *devInfo ); cusolverStatus_t cusolverDnZgetrs(cusolverDnHandle_t handle, cublasOperation_t trans, int n, int nrhs, const cuDoubleComplex *A, int lda, const int *devIpiv, cuDoubleComplex *B, int ldb, int *devInfo );
This function solves a linear system of multiple righthand sides
$\mathrm{op(A)}*X=B$ 
where A is a n×n matrix, and was LUfactored by getrf, that is, lower trianular part of A is L, and upper triangular part (including diagonal elements) of A is U. B is a n×nrhs righthand side matrix.
The input parameter trans is defined by
$\text{op}(A)=\left\{\begin{array}{ll}A& \text{if}\mathsf{\text{trans == CUBLAS\_OP\_N}}\\ {A}^{T}& \text{if}\mathsf{\text{trans == CUBLAS\_OP\_T}}\\ {A}^{H}& \text{if}\mathsf{\text{trans == CUBLAS\_OP\_C}}\end{array}\right.$
The input parameter devIpiv is an output of getrf. It contains pivot indices, which are used to permutate righthand sides.
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
The user can combine getrf and getrs to complete a linear solver. Please refer to appendix D.1.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
trans  host  input  operation op(A) that is non or (conj.) transpose. 
n  host  input  number of rows and columns of matrix A. 
nrhs  host  input  number of righthand sides. 
A  device  input  <type> array of dimension lda * n with lda is not less than max(1,n). 
lda  host  input  leading dimension of twodimensional array used to store matrix A. 
devIpiv  device  input  array of size at least n, containing pivot indices. 
B  device  output  <type> array of dimension ldb * nrhs with ldb is not less than max(1,n). 
ldb  host  input  leading dimension of twodimensional array used to store matrix B. 
devInfo  device  output  if devInfo = 0, the operation is successful. if devInfo = i, the ith parameter is wrong (not counting handle). 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (n<0 or lda<max(1,n) or ldb<max(1,n)). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>geqrf()
cusolverStatus_t cusolverDnSgeqrf_bufferSize(cusolverDnHandle_t handle, int m, int n, float *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnDgeqrf_bufferSize(cusolverDnHandle_t handle, int m, int n, double *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnCgeqrf_bufferSize(cusolverDnHandle_t handle, int m, int n, cuComplex *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnZgeqrf_bufferSize(cusolverDnHandle_t handle, int m, int n, cuDoubleComplex *A, int lda, int *Lwork );
cusolverStatus_t cusolverDnSgeqrf(cusolverDnHandle_t handle, int m, int n, float *A, int lda, float *TAU, float *Workspace, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnDgeqrf(cusolverDnHandle_t handle, int m, int n, double *A, int lda, double *TAU, double *Workspace, int Lwork, int *devInfo );
cusolverStatus_t cusolverDnCgeqrf(cusolverDnHandle_t handle, int m, int n, cuComplex *A, int lda, cuComplex *TAU, cuComplex *Workspace, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnZgeqrf(cusolverDnHandle_t handle, int m, int n, cuDoubleComplex *A, int lda, cuDoubleComplex *TAU, cuDoubleComplex *Workspace, int Lwork, int *devInfo );
This function computes the QR factorization of a m×n matrix
$A=Q*R$ 
where A is a m×n matrix, Q is a m×n matrix, and R is a n×n upper triangular matrix.
The user has to provide working space which is pointed by input parameter Workspace. The input parameter Lwork is size of the working space, and it is returned by geqrf_bufferSize().
The matrix R is overwritten in upper triangular part of A, including diagonal elements.
The matrix Q is not formed explicitly, instead, a sequence of householder vectors are stored in lower triangular part of A. The leading nonzero element of householder vector is assumed to be 1 such that output parameter TAU contains the scaling factor τ. If v is original householder vector, q is the new householder vector corresponding to τ, satisying the following relation
$I2*v*{v}^{H}=I\tau *q*{q}^{H}$ 
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
m  host  input  number of rows of matrix A. 
n  host  input  number of columns of matrix A. 
A  device  in/out  <type> array of dimension lda * n with lda is not less than max(1,m). 
lda  host  input  leading dimension of twodimensional array used to store matrix A. 
TAU  device  output  <type> array of dimension at least min(m,n). 
Workspace  device  in/out  working space, <type> array of size Lwork. 
Lwork  host  input  size of working array Workspace. 
devInfo  device  output  if devInfo = 0, the LU factorization is successful. if devInfo = i, the ith parameter is wrong (not counting handle). 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (m,n<0 or lda<max(1,m)). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>ormqr()
cusolverStatus_t cusolverDnSormqr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const float *A, int lda, const float *tau, const float *C, int ldc, int *lwork); cusolverStatus_t cusolverDnDormqr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const double *A, int lda, const double *tau, const double *C, int ldc, int *lwork); cusolverStatus_t cusolverDnCunmqr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const cuComplex *A, int lda, const cuComplex *tau, const cuComplex *C, int ldc, int *lwork); cusolverStatus_t cusolverDnZunmqr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, const cuDoubleComplex *C, int ldc, int *lwork);
cusolverStatus_t cusolverDnSormqr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const float *A, int lda, const float *tau, float *C, int ldc, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDormqr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const double *A, int lda, const double *tau, double *C, int ldc, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnCunmqr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const cuComplex *A, int lda, const cuComplex *tau, cuComplex *C, int ldc, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnZunmqr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasOperation_t trans, int m, int n, int k, const cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, cuDoubleComplex *C, int ldc, cuDoubleComplex *work, int lwork, int *devInfo);
This function overwrites m×n matrix C by
$C=\left\{\begin{array}{ll}\text{op}(Q)*C& \text{if}\mathsf{\text{side == CUBLAS\_SIDE\_LEFT}}\\ C*\text{op}(Q)& \text{if}\mathsf{\text{side == CUBLAS\_SIDE\_RIGHT}}\end{array}\right.$ 
The operation of Q is defined by
$\text{op}(Q)=\left\{\begin{array}{ll}Q& \text{if}\mathsf{\text{transa == CUBLAS\_OP\_N}}\\ {Q}^{T}& \text{if}\mathsf{\text{transa == CUBLAS\_OP\_T}}\\ {Q}^{H}& \text{if}\mathsf{\text{transa == CUBLAS\_OP\_C}}\end{array}\right.$ 
Q is a unitary matrix formed by a sequence of elementary reflection vectors from QR factorization (geqrf) of A.
Q=H(1)H(2) ... H(k)
Q is of order m if side = CUBLAS_SIDE_LEFT and of order n if side = CUBLAS_SIDE_RIGHT.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by geqrf_bufferSize() or ormqr_bufferSize().
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
The user can combine geqrf, ormqr and trsm to complete a linear solver or a leastsquare solver. Please refer to appendix C.1.
parameter  Memory  In/out  Meaning 
handle  host  input  Handle to the cuSolverDn library context. 
side  host  input  Indicates if matrix Q is on the left or right of C. 
trans  host  input  Operation op(Q) that is non or (conj.) transpose. 
m  host  input  Number of columns of matrix C. 
n  host  input  Number of rows of matrix C. 
k  host  input  Number of elementary relfections whose product defines the matrix Q. 
A  device  in/out  <type> array of dimension lda * k with lda is not less than max(1,m). The matrix A is from geqrf, so ith column contains elementary reflection vector. 
lda  host  input  Leading dimension of twodimensional array used to store matrix A. if side is CUBLAS_SIDE_LEFT, lda >= max(1,m); if side is CUBLAS_SIDE_RIGHT, lda >= max(1,n). 
tau  device  output  <type> array of dimension at least min(m,n). The vector tau is from geqrf, so tau(i) is the scalar of ith elementary reflection vector. 
C  device  in/out  <type> array of size ldc * n. On exit, C is overwritten by op(Q)*C. 
ldc  host  input  Leading dimension of twodimensional array of matrix C. ldc >= max(1,m). 
work  device  in/out  Working space, <type> array of size lwork. 
lwork  host  input  Size of working array work. 
devInfo  device  output  If devInfo = 0, the ormqr is successful. If devInfo = i, the ith parameter is wrong (not counting handle). 
CUSOLVER_STATUS_SUCCESS  The operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  The library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  Invalid parameters were passed (m,n<0 or wrong lda or ldc). 
CUSOLVER_STATUS_ARCH_MISMATCH  The device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  An internal operation failed. 
cusolverDn<t>orgqr()
cusolverStatus_t cusolverDnSorgqr_bufferSize( cusolverDnHandle_t handle, int m, int n, int k, const float *A, int lda, const float *tau, int *lwork); cusolverStatus_t cusolverDnDorgqr_bufferSize( cusolverDnHandle_t handle, int m, int n, int k, const double *A, int lda, const double *tau, int *lwork); cusolverStatus_t cusolverDnCungqr_bufferSize( cusolverDnHandle_t handle, int m, int n, int k, const cuComplex *A, int lda, const cuComplex *tau, int *lwork); cusolverStatus_t cusolverDnZungqr_bufferSize( cusolverDnHandle_t handle, int m, int n, int k, const cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, int *lwork);
cusolverStatus_t cusolverDnSorgqr( cusolverDnHandle_t handle, int m, int n, int k, float *A, int lda, const float *tau, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDorgqr( cusolverDnHandle_t handle, int m, int n, int k, double *A, int lda, const double *tau, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnCungqr( cusolverDnHandle_t handle, int m, int n, int k, cuComplex *A, int lda, const cuComplex *tau, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnZungqr( cusolverDnHandle_t handle, int m, int n, int k, cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, cuDoubleComplex *work, int lwork, int *devInfo);
This function overwrites m×n matrix A by
$Q=\mathrm{H(1)}*\mathrm{H(2)}*\mathrm{...}*\mathrm{H(k)}$ 
where Q is a unitary matrix formed by a sequence of elementary reflection vectors stored in A.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by orgqr_bufferSize().
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
The user can combine geqrf, orgqr to complete orthogonalization. Please refer to appendix C.2.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
m  host  input  number of rows of matrix Q. m >= 0; 
n  host  input  number of columns of matrix Q. m >= n >= 0; 
k  host  input  number of elementary relfections whose product defines the matrix Q. n >= k >= 0; 
A  device  in/out  <type> array of dimension lda * n with lda is not less than max(1,m). ith column of A contains elementary reflection vector. 
lda  host  input  leading dimension of twodimensional array used to store matrix A. lda >= max(1,m). 
tau  device  output  <type> array of dimension k. tau(i) is the scalar of ith elementary reflection vector. 
work  device  in/out  working space, <type> array of size lwork. 
lwork  host  input  size of working array work. 
devInfo  device  output  if info = 0, the orgqr is successful. if info = i, the ith parameter is wrong (not counting handle). 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (m,n,k<0, n>m, k>n or lda<m). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>sytrf()
cusolverStatus_t cusolverDnSsytrf_bufferSize(cusolverDnHandle_t handle, int n, float *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnDsytrf_bufferSize(cusolverDnHandle_t handle, int n, double *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnCsytrf_bufferSize(cusolverDnHandle_t handle, int n, cuComplex *A, int lda, int *Lwork ); cusolverStatus_t cusolverDnZsytrf_bufferSize(cusolverDnHandle_t handle, int n, cuDoubleComplex *A, int lda, int *Lwork );
cusolverStatus_t cusolverDnSsytrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, int *ipiv, float *work, int lwork, int *devInfo ); cusolverStatus_t cusolverDnDsytrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, int *ipiv, double *work, int lwork, int *devInfo );
cusolverStatus_t cusolverDnCsytrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, int *ipiv, cuComplex *work, int lwork, int *devInfo ); cusolverStatus_t cusolverDnZsytrf(cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, int *ipiv, cuDoubleComplex *work, int lwork, int *devInfo );
This function computes the BunchKaufman factorization of a n×n symmetric indefinite matrix
A is a n×n symmetric matrix, only lower or upper part is meaningful. The input parameter uplo which part of the matrix is used. The function would leave other part untouched.
If input parameter uplo is CUBLAS_FILL_MODE_LOWER, only lower triangular part of A is processed, and replaced by lower triangular factor L and block diagonal matrix D. Each block of D is either 1x1 or 2x2 block, depending on pivoting.
$P*A*{P}^{T}=L*D*{L}^{T}$ 
If input parameter uplo is CUBLAS_FILL_MODE_UPPER, only upper triangular part of A is processed, and replaced by upper triangular factor U and block diagonal matrix D.
$P*A*{P}^{T}=U*D*{U}^{T}$ 
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by sytrf_bufferSize().
If BunchKaufman factorization failed, i.e. A is singular. The output parameter devInfo = i would indicate D(i,i)=0.
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
The output parameter devIpiv contains pivoting sequence. If devIpiv(i) = k > 0, D(i,i) is 1x1 block, and ith row/column of A is interchanged with kth row/column of A. If uplo is CUBLAS_FILL_MODE_UPPER and devIpiv(i1) = devIpiv(i) = m < 0, D(i1:i,i1:i) is a 2x2 block, and (i1)th row/column is interchanged with mth row/column. If uplo is CUBLAS_FILL_MODE_LOWER and devIpiv(i+1) = devIpiv(i) = m < 0, D(i:i+1,i:i+1) is a 2x2 block, and (i+1)th row/column is interchanged with mth row/column.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
uplo  host  input  indicates if matrix A lower or upper part is stored, the other part is not referenced. 
n  host  input  number of rows and columns of matrix A. 
A  device  in/out  <type> array of dimension lda * n with lda is not less than max(1,n). 
lda  host  input  leading dimension of twodimensional array used to store matrix A. 
ipiv  device  output  array of size at least n, containing pivot indices. 
work  device  in/out  working space, <type> array of size lwork. 
lwork  host  input  size of working space work. 
devInfo  device  output  if devInfo = 0, the LU factorization is successful. if devInfo = i, the ith parameter is wrong (not counting handle). if devInfo = i, the D(i,i) = 0. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (n<0 or lda<max(1,n)). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>potrfBatched()
cusolverStatus_t cusolverDnSpotrfBatched( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *Aarray[], int lda, int *infoArray, int batchSize); cusolverStatus_t cusolverDnDpotrfBatched( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, double *Aarray[], int lda, int *infoArray, int batchSize);
cusolverStatus_t cusolverDnCpotrfBatched( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *Aarray[], int lda, int *infoArray, int batchSize); cusolverStatus_t cusolverDnZpotrfBatched( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *Aarray[], int lda, int *infoArray, int batchSize);
This function computes the Cholesky factorization of a squence of Hermitian positivedefinite matrices.
Each Aarray[i] for i=0,1,..., batchSize1 is a n×n Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo indicates which part of the matrix is used.
If input parameter uplo is CUBLAS_FILL_MODE_LOWER, only lower triangular part of A is processed, and replaced by lower triangular Cholesky factor L.
$A=L*{L}^{H}$ 
If input parameter uplo is CUBLAS_FILL_MODE_UPPER, only upper triangular part of A is processed, and replaced by upper triangular Cholesky factor U.
$A={U}^{H}*U$ 
If Cholesky factorization failed, i.e. some leading minor of A is not positive definite, or equivalently some diagonal elements of L or U is not a real number. The output parameter infoArray would indicate smallest leading minor of A which is not positive definite.
infoArray is an integer array of size batchsize. If potrfBatched returns CUSOLVER_STATUS_INVALID_VALUE, infoArray[0] = i (less than zero), meaning that the ith parameter is wrong (not counting handle). If potrfBatched returns CUSOLVER_STATUS_SUCCESS but infoArray[i] = k is positive, then ith matrix is not positive definite and the Cholesky factorization failed at row k.
Remark: the other part of A is used as a workspace. For example, if uplo is CUBLAS_FILL_MODE_UPPER, upper triangle of A contains cholesky factor U and lower triangle of A is destroyed after potrfBatched.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
uplo  host  input  indicates if lower or upper part is stored, the other part is used as a workspace. 
n  host  input  number of rows and columns of matrix A. 
Aarray  device  in/out  array of pointers to <type> array of dimension lda * n with lda is not less than max(1,n). 
lda  host  input  leading dimension of twodimensional array used to store each matrix Aarray[i]. 
infoArray  device  output  array of size batchSize. infoArray[i] contains information of factorization of Aarray[i]. if potrfBatched returns CUSOLVER_STATUS_INVALID_VALUE, infoArray[0] = i (less than zero) means the ith parameter is wrong (not counting handle). if potrfBatched returns CUSOLVER_STATUS_SUCCESS, infoArray[i] = 0 means the Cholesky factorization of ith matrix is successful, and infoArray[i] = k means the leading submatrix of order k of ith matrix is not positive definite. 
batchSize  host  input  number of pointers in Aarray. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (n<0 or lda<max(1,n) or batchSize<1). 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>potrsBatched()
cusolverStatus_t cusolverDnSpotrsBatched( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, float *Aarray[], int lda, float *Barray[], int ldb, int *info, int batchSize); cusolverStatus_t cusolverDnDpotrsBatched( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, double *Aarray[], int lda, double *Barray[], int ldb, int *info, int batchSize); cusolverStatus_t cusolverDnCpotrsBatched( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, cuComplex *Aarray[], int lda, cuComplex *Barray[], int ldb, int *info, int batchSize); cusolverStatus_t cusolverDnZpotrsBatched( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, int nrhs, cuDoubleComplex *Aarray[], int lda, cuDoubleComplex *Barray[], int ldb, int *info, int batchSize);
This function solves a squence of linear systems
$\mathrm{A[i]}*\mathrm{X[i]}=\mathrm{B[i]}$ 
where each Aarray[i] for i=0,1,..., batchSize1 is a n×n Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo indicates which part of the matrix is used.
The user has to call potrfBatched first to factorize matrix Aarray[i]. If input parameter uplo is CUBLAS_FILL_MODE_LOWER, A is lower triangular Cholesky factor L correspoding to $A=L*{L}^{H}$ . If input parameter uplo is CUBLAS_FILL_MODE_UPPER, A is upper triangular Cholesky factor U corresponding to $A={U}^{H}*U$ .
The operation is inplace, i.e. matrix X overwrites matrix B with the same leading dimension ldb.
The output parameter info is a scalar. If info = i (less than zero), the ith parameter is wrong (not counting handle).
Remark 1: only nrhs=1 is supported.
Remark 2: infoArray from potrfBatched indicates if the matrix is positive definite. info from potrsBatched only shows which input parameter is wrong (not counting handle).
Remark 3: the other part of A is used as a workspace. For example, if uplo is CUBLAS_FILL_MODE_UPPER, upper triangle of A contains cholesky factor U and lower triangle of A is destroyed after potrsBatched.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolveDN library context. 
uplo  host  input  indicates if matrix A lower or upper part is stored. 
n  host  input  number of rows and columns of matrix A. 
nrhs  host  input  number of columns of matrix X and B. 
Aarray  device  in/out  array of pointers to <type> array of dimension lda * n with lda is not less than max(1,n). Aarray[i] is either lower cholesky factor L or upper Cholesky factor U. 
lda  host  input  leading dimension of twodimensional array used to store each matrix Aarray[i]. 
Barray  device  in/out  array of pointers to <type> array of dimension ldb * nrhs. ldb is not less than max(1,n). As an input, Barray[i] is right hand side matrix. As an output, Barray[i] is the solution matrix. 
ldb  host  input  leading dimension of twodimensional array used to store each matrix Barray[i]. 
info  device  output  if info = 0, all parameters are correct. if info = i, the ith parameter is wrong (not counting handle). 
batchSize  host  input  number of pointers in Aarray. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (n<0, nrhs<0, lda<max(1,n), ldb<max(1,n) or batchSize<0). 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t1><t2>gesv()
These functions are modelled after functions DSGESV and ZCGESV from LAPACK and compute the solution to a system of linear equations with multiple right hand sides using mixed precision iterative refinement
$A\times X=B$ 
Where A is nbyn matrix and X and B are nbynrhs matrices.
Functions are designed to be as close to LAPACK dropin replacements as possible. Parameters and behaviour are mostly the same as LAPACK counterparts. Description of LAPACK functions and differences from them are below.
<t1><t2>gesv() functions are designated by two floating point precisions  data type (full) precision and internal lower precision. cusolver<t1><t2>gesv() first attempts to factorize the matrix in lower precision and use this factorization within an iterative refinement procedure to obtain a solution with same normwise backward error as full precision. If the approach fails to converge, then the method switches to a full precision factorization and solve.
Note that in addition to the data type / lower floating point precision functions available in LAPACK  also functions with half precision as a lower precision are present. The following table specifies which precisions will be used for which interface function:
Interface function  Data type (matrix, rhs and solution)  Data floating point precision  Internal (compute) floating point precision 
cusolverDnDDgesv  double  double  double 
cusolverDnDSgesv  double  double  single 
cusolverDnDHgesv  double  double  half 
cusolverDnSSgesv  float  single  single 
cusolverDnSHgesv  float  single  half 
cusolverDnZZgesv  cuDoubleComplex  double  double 
cusolverDnZCgesv  cuDoubleComplex  double  single 
cusolverDnZKgesv  cuDoubleComplex  double  half 
cusolverDnCCgesv  cuComplex  single  single 
cusolverDnCKgesv  cuComplex  single  half 
ITER > ITERMAX
or for all the RHS we have:
RNRM < SQRT(N)*XNRM*ANRM*EPS*BWDMAX
where
 ITER is the number of the current iteration in the iterative refinement process
 RNRM is the infinitynorm of the residual
 XNRM is the infinitynorm of the solution
 ANRM is the infinityoperatornorm of the matrix A
 EPS is the machine epsilon that matches LAPACK <t1>LAMCH('Epsilon')
Solve process results will be indicated by output parameter info , see parameter description.
User should provide a large enough workspace allocated on the device for the <t1><t2>gesv() functions. The amount of bytes required can be queried by the respective <t1><t2>gesv_bufferSize() functions.
cusolverStatus_t cusolverDnDDgesv_bufferSize( cusolverHandle_t handle, int n, int nrhs, double * dA, int ldda, int * dipiv, double * dB, int lddb, double * dX, int lddx, void * dwork, size_t * lwork_bytes); cusolverStatus_t cusolverDnDSgesv_bufferSize( cusolverHandle_t handle, int n, int nrhs, double * dA, int ldda, int * dipiv, double * dB, int lddb, double * dX, int lddx, void * dwork, size_t * lwork_bytes); cusolverStatus_t cusolverDnDHgesv_bufferSize( cusolverHandle_t handle, int n, int nrhs, double * dA, int ldda, int * dipiv, double * dB, int lddb, double * dX, int lddx, void * dwork, size_t * lwork_bytes); cusolverStatus_t cusolverDnSSgesv_bufferSize( cusolverHandle_t handle, int n, int nrhs, float * dA, int ldda, int * dipiv, float * dB, int lddb, float * dX, int lddx, void * dwork, size_t * lwork_bytes); cusolverStatus_t cusolverDnSHgesv_bufferSize( cusolverHandle_t handle, int n, int nrhs, float * dA, int ldda, int * dipiv, float * dB, int lddb, float * dX, int lddx, void * dwork, size_t * lwork_bytes); cusolverStatus_t cusolverDnZZgesv_bufferSize( cusolverHandle_t handle, int n, int nrhs, cuDoubleComplex * dA, int ldda, int * dipiv, cuDoubleComplex * dB, int lddb, cuDoubleComplex * dX, int lddx, void * dwork, size_t * lwork_bytes); cusolverStatus_t cusolverDnZCgesv_bufferSize( cusolverHandle_t handle, int n, int nrhs, cuDoubleComplex * dA, int ldda, int * dipiv, cuDoubleComplex * dB, int lddb, cuDoubleComplex * dX, int lddx, void * dwork, size_t * lwork_bytes); cusolverStatus_t cusolverDnZKgesv_bufferSize( cusolverHandle_t handle, int n, int nrhs, cuDoubleComplex * dA, int ldda, int * dipiv, cuDoubleComplex * dB, int lddb, cuDoubleComplex * dX, int lddx, void * dwork, size_t * lwork_bytes); cusolverStatus_t cusolverDnCCgesv_bufferSize( cusolverHandle_t handle, int n, int nrhs, cuComplex * dA, int ldda, int * dipiv, cuComplex * dB, int lddb, cuComplex * dX, int lddx, void * dwork, size_t * lwork_bytes); cusolverStatus_t cusolverDnCKgesv_bufferSize( cusolverHandle_t handle, int n, int nrhs, cuComplex * dA, int ldda, int * dipiv, cuComplex * dB, int lddb, cuComplex * dX, int lddx, void * dwork, size_t * lwork_bytes);
parameter  Memory  In/out  Meaning 
handle  host  input  Handle to the cusolverDN library context. 
n  host  input  Number of rows and columns of square matrix A. Should be nonnegative. 
nrhs  host  input  Number of right hand sides to solve. Should be nonnegative. nrhs is limited to 1 if selected IRS solver is CUSOLVER_IRS_GMRES. 
dA  device  in  Matrix A with size nbyn. Can be NULL. 
ldda  host  input  leading dimension of twodimensional array used to store matrix A. lda >= n. 
dipiv  device  None  Pivoting sequence. Not used and can be NULL. 
dB  device  in  Set of right hand sides B of size nbynrhs. Can be NULL. 
lddb  host  input  leading dimension of twodimensional array used to store matrix of right hand sides B. ldb >= n. 
dX  device  in  Set of soultion vectors X of size nbynrhs. Can be NULL. 
lddx  host  input  leading dimension of twodimensional array used to store matrix of solution vectors X. ldx >= n. 
dwork  device  none  Pointer to device workspace. Not used and can be NULL. 
lwork_bytes  host  out  Pointer to a variable where required size of temporary workspace in bytes will be stored. Can't be NULL. 
cusolverStatus_t cusolverDnZZgesv( cusolverDnHandle_t handle, int n, int nrhs, cuDoubleComplex * dA, int ldda, int * dipiv, cuDoubleComplex * dB, int lddb, cuDoubleComplex * dX, int lddx, void * dWorkspace, size_t lwork_bytes, int * iter, int * d_info); cusolverStatus_t cusolverDnZCgesv( cusolverDnHandle_t handle, int n, int nrhs, cuDoubleComplex * dA, int ldda, int * dipiv, cuDoubleComplex * dB, int lddb, cuDoubleComplex * dX, int lddx, void * dWorkspace, size_t lwork_bytes, int * iter, int * d_info); cusolverStatus_t cusolverDnZKgesv( cusolverDnHandle_t handle, int n, int nrhs, cuDoubleComplex * dA, int ldda, int * dipiv, cuDoubleComplex * dB, int lddb, cuDoubleComplex * dX, int lddx, void * dWorkspace, size_t lwork_bytes, int * iter, int * d_info); cusolverStatus_t cusolverDnCCgesv( cusolverDnHandle_t handle, int n, int nrhs, cuComplex * dA, int ldda, int * dipiv, cuComplex * dB, int lddb, cuComplex * dX, int lddx, void * dWorkspace, size_t lwork_bytes, int * iter, int * d_info); cusolverStatus_t cusolverDnCKgesv( cusolverDnHandle_t handle, int n, int nrhs, cuComplex * dA, int ldda, int * dipiv, cuComplex * dB, int lddb, cuComplex * dX, int lddx, void * dWorkspace, size_t lwork_bytes, int * iter, int * d_info); cusolverStatus_t cusolverDnDDgesv( cusolverDnHandle_t handle, int n, int nrhs, double * dA, int ldda, int * dipiv, double * dB, int lddb, double * dX, int lddx, void * dWorkspace, size_t lwork_bytes, int * iter, int * d_info); cusolverStatus_t cusolverDnDSgesv( cusolverDnHandle_t handle, int n, int nrhs, double * dA, int ldda, int * dipiv, double * dB, int lddb, double * dX, int lddx, void * dWorkspace, size_t lwork_bytes, int * iter, int * d_info); cusolverStatus_t cusolverDnDHgesv( cusolverDnHandle_t handle, int n, int nrhs, double * dA, int ldda, int * dipiv, double * dB, int lddb, double * dX, int lddx, void * dWorkspace, size_t lwork_bytes, int * iter, int * d_info); cusolverStatus_t cusolverDnSSgesv( cusolverDnHandle_t handle, int n, int nrhs, float * dA, int ldda, int * dipiv, float * dB, int lddb, float * dX, int lddx, void * dWorkspace, size_t lwork_bytes, int * iter, int * d_info); cusolverStatus_t cusolverDnSHgesv( cusolverDnHandle_t handle, int n, int nrhs, float * dA, int ldda, int * dipiv, float * dB, int lddb, float * dX, int lddx, void * dWorkspace, size_t lwork_bytes, int * iter, int * d_info);
parameter  Memory  In/out  Meaning 
handle  host  input  Handle to the cusolverDN library context. 
n  host  input  Number of rows and columns of square matrix A. Should be nonnegative. 
nrhs  host  input  Number of right hand sides to solve. Should be nonnegative. nrhs is limited to 1 if selected IRS solver is CUSOLVER_IRS_GMRES. 
dA  device  in/out  Matrix A with size nbyn. Can't be NULL. On return  unchanged if solve process iterative refinement converged. If not  will contain full precision factorization of matrix A : A = P * L * U, where P  permutation matrix defined by vector ipiv, L and U  lower and upper triangular matrices. 
ldda  host  input  leading dimension of twodimensional array used to store matrix A. lda >= n. 
dipiv  device  in/out  Vector that defines permutation matrix for factorization  row i was interchanged with row ipiv[i] If NULL then no pivoting is performed. 
dB  device  in/out  Set of right hand sides B of size nbynrhs . Can't be NULL. 
lddb  host  input  leading dimension of twodimensional array used to store matrix of right hand sides B. ldb >= n. 
dX  device  in/out  Set of soultion vectors X of size nbynrhs. Can't be NULL. 
lddx  host  input  leading dimension of twodimensional array used to store matrix of solution vectors X. ldx >= n. 
dWorkspace  device  in/out  Pointer to a workspace in device memory of size lwork_bytes. 
lwork_bytes  host  in  Size of provided device workspace in bytes. Should be at least what was returned by cusolverDn<T1><T2>gesv_bufferSize() function 
iter  host  output  If iter is

info  device  output  Status of the iterative refinement on the return. If 0  solve was successful. If info = i then ith argument has is not valid. If info = i, then U(i,i) computed in full precision is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution could not be computed. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed:

CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 7.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
2.4.2.13. cusolverDnIRSXgesv()
This function is designed to perform same functionality as cusolverDn<T1><T2>gesv() functions, but wrapped in a more generic and expert interface that gives user more control to parametrize the function as well as it provides more informations on output. See cusolverDn<T1><T2>gesv() description for detailed explanation of the algorithm functionality and behaviour. cusolverDnIRSXgesv() allows additional control of the solver parameters such as setting the  lowest floating point precision authorized to be used by the solver, refinement solver type, maximum allowed number of iterative solver iterations, tolerance of the refinement solver  through Xgesv parameters structure and helper functions. cusolverDnIRSXgesv() provides additional informations on the output such as the convergence history of the residual array and the number of iterations needed to converge.
Following table provides authorized values for lowest precision parameter for specified full data type. Note that if lowest precision matches full datatype, then full precision factorization will be used
Data Type  Supported values for lowest precision in Xgesv parameters structure 
CUDA_R_32F  CUDA_R_32F, CUDA_R_16F 
CUDA_R_64F  CUDA_R_64F, CUDA_R_32F, CUDA_R_16F 
CUDA_C_32F  CUDA_C_32F, CUDA_C_16F 
CUDA_C_64F  CUDA_C_64F, CUDA_C_32F, CUDA_C_16F 
Solve process results will be indicated by output parameter info, see parameter description.
User should provide large enough workspace allocated on device for the cusolverDnIRSXgesv() function. Amount of bytes required for the function can be retrieved by respective function cusolverDnIRSXgesv_bufferSize()
cusolverStatus_t cusolverDnIRSXgesv_bufferSize( cusolverDnHandle_t handle, cusolverDnIRSParams_t params, cusolver_int_t n, cusolver_int_t nrhs, size_t * lwork_bytes);
parameter  Memory  In/out  Meaning 
handle  host  input  Handle to the cusolverDn library context. 
params  host  input  Xgesv solve parameters 
n  host  input  Number of rows and columns of the square matrix A. Should be nonnegative. 
nrhs  host  input  Number of right hand sides to solve. Should be nonnegative. Note that, nrhs is limited to 1 if the selected IRS refinement solver is CUSOLVER_IRS_REFINE_GMRES, CUSOLVER_IRS_REFINE_GMRES_GMRES, CUSOLVER_IRS_REFINE_CLASSICAL_GMRES. 
lwork_bytes  host  out  Pointer to a variable, where the required size in bytes, of the workspace will be stored after a call to cusolverDnIRSXgesv_bufferSize. Can't be NULL. 
cusolverStatus_t cusolverDnIRSXgesv( cusolverDnHandle_t handle, cusolverDnIRSParams_t gesv_irs_params, cusolverDnIRSInfos_t gesv_irs_infos, cudaDataType inout_data_type, int n, int nrhs, void * dA, int ldda, int * dipiv, void * dB, int lddb, void * dX, int lddx, void * dWorkspace, size_t lwork_bytes, int * niters, int * dinfo);
parameter  Memory  In/out  Meaning 
handle  host  input  Handle to the cusolverDn library context. 
gesv_irs_params  host  input  Solve parameters handle 
gesv_irs_infos  host  in/out  Info structure parameter handle where information about performed solve will be stored. 
inout_data_type  host  input  Datatype of Matrix, right hand side and solution 
n  host  input  Number of rows and columns of square matrix A. Should be nonnegative. 
nrhs  host  input  Number of right hand sides to solve. Should be nonnegative. Note that, nrhs is limited to 1 if the selected IRS refinement solver is CUSOLVER_IRS_REFINE_GMRES, CUSOLVER_IRS_REFINE_GMRES_GMRES, CUSOLVER_IRS_REFINE_CLASSICAL_GMRES. Number of right hand sides to solve. Should be nonnegative. 
dA  device  in  Matrix A with size nbyn. Can't be NULL. On return  unchanged if the iterative refinement solver converged. If not  will contain full precision factorization of matrix A : A = P * L * U, where P  permutation matrix defined by vector ipiv, L and U  lower and upper triangular matrices. 
ldda  host  input  leading dimension of twodimensional array used to store matrix A. lda >= n.. 
dipiv  device  in/out  Vector that defines permutation matrix for factorization  row i was interchanged with row ipiv[i] If NULL then no pivoting is performed. 
dB  device  in  Set of right hand sides B of size nbynrhs . Can be NULL. 
lddb  host  input  leading dimension of twodimensional array used to store matrix of right hand sides B. ldb >= n.. 
dX  device  in  Set of soultion vectors X of size nbynrhs . Can be NULL. 
lddx  host  input  leading dimension of twodimensional array used to store matrix of solution vectors X. ldx >= n.. 
dWorkspace  device  in/out  Pointer to a workspace in device memory of size lwork_bytes. 
lwork_bytes  host  in  Size of device workspace. Should be at least what was returned by cusolverDnIRSXgesv_bufferSize() function 
niters  host  output  If iter is

dinfo  device  output  Status of the iterative refinement on the return. If 0  solve was successful. If dinfo = i then ith argument has is not valid. If dinfo = i, then U(i,i) computed in full precision is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution could not be computed. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed:

CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 7.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
Dense Eigenvalue Solver Reference
This chapter describes eigenvalue solver API of cuSolverDN, including bidiagonalization and SVD.
cusolverDn<t>gebrd()
cusolverStatus_t cusolverDnSgebrd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *Lwork ); cusolverStatus_t cusolverDnDgebrd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *Lwork ); cusolverStatus_t cusolverDnCgebrd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *Lwork ); cusolverStatus_t cusolverDnZgebrd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *Lwork );
cusolverStatus_t cusolverDnSgebrd(cusolverDnHandle_t handle, int m, int n, float *A, int lda, float *D, float *E, float *TAUQ, float *TAUP, float *Work, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnDgebrd(cusolverDnHandle_t handle, int m, int n, double *A, int lda, double *D, double *E, double *TAUQ, double *TAUP, double *Work, int Lwork, int *devInfo );
cusolverStatus_t cusolverDnCgebrd(cusolverDnHandle_t handle, int m, int n, cuComplex *A, int lda, float *D, float *E, cuComplex *TAUQ, cuComplex *TAUP, cuComplex *Work, int Lwork, int *devInfo ); cusolverStatus_t cusolverDnZgebrd(cusolverDnHandle_t handle, int m, int n, cuDoubleComplex *A, int lda, double *D, double *E, cuDoubleComplex *TAUQ, cuDoubleComplex *TAUP, cuDoubleComplex *Work, int Lwork, int *devInfo );
This function reduces a general m×n matrix A to a real upper or lower bidiagonal form B by an orthogonal transformation: ${Q}^{H}*A*P=B$
If m>=n, B is upper bidiagonal; if m<n, B is lower bidiagonal.
The matrix Q and P are overwritten into matrix A in the following sense:
if m>=n, the diagonal and the first superdiagonal are overwritten with the upper bidiagonal matrix B; the elements below the diagonal, with the array TAUQ, represent the orthogonal matrix Q as a product of elementary reflectors, and the elements above the first superdiagonal, with the array TAUP, represent the orthogonal matrix P as a product of elementary reflectors.
if m<n, the diagonal and the first subdiagonal are overwritten with the lower bidiagonal matrix B; the elements below the first subdiagonal, with the array TAUQ, represent the orthogonal matrix Q as a product of elementary reflectors, and the elements above the diagonal, with the array TAUP, represent the orthogonal matrix P as a product of elementary reflectors.
The user has to provide working space which is pointed by input parameter Work. The input parameter Lwork is size of the working space, and it is returned by gebrd_bufferSize().
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
Remark: gebrd only supports m>=n.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
m  host  input  number of rows of matrix A. 
n  host  input  number of columns of matrix A. 
A  device  in/out  <type> array of dimension lda * n with lda is not less than max(1,n). 
lda  host  input  leading dimension of twodimensional array used to store matrix A. 
D  device  output  real array of dimension min(m,n). The diagonal elements of the bidiagonal matrix B: D(i) = A(i,i). 
E  device  output  real array of dimension min(m,n). The offdiagonal elements of the bidiagonal matrix B: if m>=n, E(i) = A(i,i+1) for i = 1,2,...,n1; if m<n, E(i) = A(i+1,i) for i = 1,2,...,m1. 
TAUQ  device  output  <type> array of dimension min(m,n). The scalar factors of the elementary reflectors which represent the orthogonal matrix Q. 
TAUP  device  output  <type> array of dimension min(m,n). The scalar factors of the elementary reflectors which represent the orthogonal matrix P. 
Work  device  in/out  working space, <type> array of size Lwork. 
Lwork  host  input  size of Work, returned by gebrd_bufferSize. 
devInfo  device  output  if devInfo = 0, the operation is successful. if devInfo = i, the ith parameter is wrong (not counting handle). 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (m,n<0, or lda<max(1,m)). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>orgbr()
cusolverStatus_t cusolverDnSorgbr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, const float *A, int lda, const float *tau, int *lwork); cusolverStatus_t cusolverDnDorgbr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, const double *A, int lda, const double *tau, int *lwork); cusolverStatus_t cusolverDnCungbr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, const cuComplex *A, int lda, const cuComplex *tau, int *lwork); cusolverStatus_t cusolverDnZungbr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, const cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, int *lwork);
cusolverStatus_t cusolverDnSorgbr( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, float *A, int lda, const float *tau, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDorgbr( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, double *A, int lda, const double *tau, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnCungbr( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, cuComplex *A, int lda, const cuComplex *tau, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnZungbr( cusolverDnHandle_t handle, cublasSideMode_t side, int m, int n, int k, cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, cuDoubleComplex *work, int lwork, int *devInfo);
This function generates one of the unitary matrices Q or P**H determined by gebrd when reducing a matrix A to bidiagonal form: ${Q}^{H}*A*P=B$
Q and P**H are defined as products of elementary reflectors H(i) or G(i) respectively.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by orgbr_bufferSize().
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
side  host  input  if side = CUBLAS_SIDE_LEFT, generate Q. if side = CUBLAS_SIDE_RIGHT, generate P**T. 
m  host  input  number of rows of matrix Q or P**T. 
n  host  input  if side = CUBLAS_SIDE_LEFT, m>= n>= min(m,k). if side = CUBLAS_SIDE_RIGHT, n>= m>= min(n,k). 
k  host  input  if side = CUBLAS_SIDE_LEFT, the number of columns in the original mbyk matrix reduced by gebrd. if side = CUBLAS_SIDE_RIGHT, the number of rows in the original kbyn matrix reduced by gebrd. 
A  device  in/out  <type> array of dimension lda * n On entry, the vectors which define the elementary reflectors, as returned by gebrd. On exit, the mbyn matrix Q or P**T. 
lda  host  input  leading dimension of twodimensional array used to store matrix A. lda >= max(1,m); 
tau  device  output  <type> array of dimension min(m,k) if side is CUBLAS_SIDE_LEFT; of dimension min(n,k) if side is CUBLAS_SIDE_RIGHT; tau(i) must contain the scalar factor of the elementary reflector H(i) or G(i), which determines Q or P**T, as returned by gebrd in its array argument TAUQ or TAUP. 
work  device  in/out  working space, <type> array of size lwork. 
lwork  host  input  size of working array work. 
devInfo  device  output  if info = 0, the ormqr is successful. if info = i, the ith parameter is wrong (not counting handle). 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (m,n<0 or wrong lda ). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>sytrd()
cusolverStatus_t cusolverDnSsytrd_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const float *A, int lda, const float *d, const float *e, const float *tau, int *lwork); cusolverStatus_t cusolverDnDsytrd_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const double *A, int lda, const double *d, const double *e, const double *tau, int *lwork); cusolverStatus_t cusolverDnChetrd_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const cuComplex *A, int lda, const float *d, const float *e, const cuComplex *tau, int *lwork); cusolverStatus_t cusolverDnZhetrd_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const cuDoubleComplex *A, int lda, const double *d, const double *e, const cuDoubleComplex *tau, int *lwork);
cusolverStatus_t cusolverDnSsytrd( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, float *d, float *e, float *tau, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDsytrd( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, double *d, double *e, double *tau, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnChetrd( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, float *d, float *e, cuComplex *tau, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t CUDENSEAPI cusolverDnZhetrd( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, double *d, double *e, cuDoubleComplex *tau, cuDoubleComplex *work, int lwork, int *devInfo);
This function reduces a general symmetric (Hermitian) n×n matrix A to real symmetric tridiagonal form T by an orthogonal transformation: ${Q}^{H}*A*Q=T$
As an output, A contains T and householder reflection vectors. If uplo = CUBLAS_FILL_MODE_UPPER, the diagonal and first superdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements above the first superdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors; If uplo = CUBLAS_FILL_MODE_LOWER, the diagonal and first subdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements below the first subdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by sytrd_bufferSize().
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
uplo  host  input  specifies which part of A is stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A is stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A is stored. 
n  host  input  number of rows (columns) of matrix A. 
A  device  in/out  <type> array of dimension lda * n with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading nbyn upper triangular part of A contains the upper triangular part of the matrix A, and the strictly lower triangular part of A is not referenced. If uplo = CUBLAS_FILL_MODE_LOWER, the leading nbyn lower triangular part of A contains the lower triangular part of the matrix A, and the strictly upper triangular part of A is not referenced. On exit, A is overwritten by T and householder reflection vectors. 
lda  host  input  leading dimension of twodimensional array used to store matrix A. lda >= max(1,n). 
D  device  output  real array of dimension n. The diagonal elements of the tridiagonal matrix T: D(i) = A(i,i). 
E  device  output  real array of dimension (n1). The offdiagonal elements of the tridiagonal matrix T: if uplo = CUBLAS_FILL_MODE_UPPER, E(i) = A(i,i+1). if uplo = CUBLAS_FILL_MODE_LOWERE(i) = A(i+1,i). 
tau  device  output  <type> array of dimension (n1). The scalar factors of the elementary reflectors which represent the orthogonal matrix Q. 
work  device  in/out  working space, <type> array of size lwork. 
lwork  host  input  size of work, returned by sytrd_bufferSize. 
devInfo  device  output  if devInfo = 0, the operation is successful. if devInfo = i, the ith parameter is wrong (not counting handle). 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (n<0, or lda<max(1,n), or uplo is not CUBLAS_FILL_MODE_LOWER or CUBLAS_FILL_MODE_UPPER). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>ormtr()
cusolverStatus_t cusolverDnSormtr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, const float *A, int lda, const float *tau, const float *C, int ldc, int *lwork); cusolverStatus_t cusolverDnDormtr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, const double *A, int lda, const double *tau, const double *C, int ldc, int *lwork); cusolverStatus_t cusolverDnCunmtr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, const cuComplex *A, int lda, const cuComplex *tau, const cuComplex *C, int ldc, int *lwork); cusolverStatus_t cusolverDnZunmtr_bufferSize( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, const cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, const cuDoubleComplex *C, int ldc, int *lwork);
cusolverStatus_t cusolverDnSormtr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, float *A, int lda, float *tau, float *C, int ldc, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDormtr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, double *A, int lda, double *tau, double *C, int ldc, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnCunmtr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, cuComplex *A, int lda, cuComplex *tau, cuComplex *C, int ldc, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnZunmtr( cusolverDnHandle_t handle, cublasSideMode_t side, cublasFillMode_t uplo, cublasOperation_t trans, int m, int n, cuDoubleComplex *A, int lda, cuDoubleComplex *tau, cuDoubleComplex *C, int ldc, cuDoubleComplex *work, int lwork, int *devInfo);
This function overwrites m×n matrix C by
$C=\left\{\begin{array}{ll}\text{op}(Q)*C& \text{if}\mathsf{\text{side == CUBLAS\_SIDE\_LEFT}}\\ C*\text{op}(Q)& \text{if}\mathsf{\text{side == CUBLAS\_SIDE\_RIGHT}}\end{array}\right.$ 
where Q is a unitary matrix formed by a sequence of elementary reflection vectors from sytrd.
The operation on Q is defined by
$\text{op}(Q)=\left\{\begin{array}{ll}Q& \text{if}\mathsf{\text{transa == CUBLAS\_OP\_N}}\\ {Q}^{T}& \text{if}\mathsf{\text{transa == CUBLAS\_OP\_T}}\\ {Q}^{H}& \text{if}\mathsf{\text{transa == CUBLAS\_OP\_C}}\end{array}\right.$ 
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by ormtr_bufferSize().
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
side  host  input  side = CUBLAS_SIDE_LEFT, apply Q or Q**T from the Left; side = CUBLAS_SIDE_RIGHT, apply Q or Q**T from the Right. 
uplo  host  input  uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A contains elementary reflectors from sytrd. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A contains elementary reflectors from sytrd. 
trans  host  input  operation op(Q) that is non or (conj.) transpose. 
m  host  input  number of rows of matrix C. 
n  host  input  number of columns of matrix C. 
A  device  in/out  <type> array of dimension lda * m if side = CUBLAS_SIDE_LEFT; lda * n if side = CUBLAS_SIDE_RIGHT. The matrix A from sytrd contains the elementary reflectors. 
lda  host  input  leading dimension of twodimensional array used to store matrix A. if side is CUBLAS_SIDE_LEFT, lda >= max(1,m); if side is CUBLAS_SIDE_RIGHT, lda >= max(1,n). 
tau  device  output  <type> array of dimension (m1) if side is CUBLAS_SIDE_LEFT; of dimension (n1) if side is CUBLAS_SIDE_RIGHT; The vector tau is from sytrd, so tau(i) is the scalar of ith elementary reflection vector. 
C  device  in/out  <type> array of size ldc * n. On exit, C is overwritten by op(Q)*C or C*op(Q). 
ldc  host  input  leading dimension of twodimensional array of matrix C. ldc >= max(1,m). 
work  device  in/out  working space, <type> array of size lwork. 
lwork  host  input  size of working array work. 
devInfo  device  output  if devInfo = 0, the ormqr is successful. if devInfo = i, the ith parameter is wrong (not counting handle). 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (m,n<0 or wrong lda or ldc). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>orgtr()
cusolverStatus_t cusolverDnSorgtr_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const float *A, int lda, const float *tau, int *lwork); cusolverStatus_t cusolverDnDorgtr_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const double *A, int lda, const double *tau, int *lwork); cusolverStatus_t cusolverDnCungtr_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const cuComplex *A, int lda, const cuComplex *tau, int *lwork); cusolverStatus_t cusolverDnZungtr_bufferSize( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, const cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, int *lwork);
cusolverStatus_t cusolverDnSorgtr( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, float *A, int lda, const float *tau, float *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnDorgtr( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, double *A, int lda, const double *tau, double *work, int lwork, int *devInfo);
cusolverStatus_t cusolverDnCungtr( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuComplex *A, int lda, const cuComplex *tau, cuComplex *work, int lwork, int *devInfo); cusolverStatus_t cusolverDnZungtr( cusolverDnHandle_t handle, cublasFillMode_t uplo, int n, cuDoubleComplex *A, int lda, const cuDoubleComplex *tau, cuDoubleComplex *work, int lwork, int *devInfo);
This function generates a unitary matrix Q which is defined as the product of n1 elementary reflectors of order n, as returned by sytrd:
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by orgtr_bufferSize().
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle).
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
uplo  host  input  uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A contains elementary reflectors from sytrd. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A contains elementary reflectors from sytrd. 
n  host  input  number of rows (columns) of matrix Q. 
A  device  in/out  <type> array of dimension lda * n On entry, matrix A from sytrd contains the elementary reflectors. On exit, matrix A contains the nbyn orthogonal matrix Q. 
lda  host  input  leading dimension of twodimensional array used to store matrix A. lda >= max(1,n). 
tau  device  output  <type> array of dimension (n1)tau(i) is the scalar of ith elementary reflection vector. 
work  device  in/out  working space, <type> array of size lwork. 
lwork  host  input  size of working array work. 
devInfo  device  output  if devInfo = 0, the orgtr is successful. if devInfo = i, the ith parameter is wrong (not counting handle). 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (n<0 or wrong lda ). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>gesvd()
cusolverStatus_t cusolverDnSgesvd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *lwork ); cusolverStatus_t cusolverDnDgesvd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *lwork ); cusolverStatus_t cusolverDnCgesvd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *lwork ); cusolverStatus_t cusolverDnZgesvd_bufferSize( cusolverDnHandle_t handle, int m, int n, int *lwork );
cusolverStatus_t cusolverDnSgesvd ( cusolverDnHandle_t handle, signed char jobu, signed char jobvt, int m, int n, float *A, int lda, float *S, float *U, int ldu, float *VT, int ldvt, float *work, int lwork, float *rwork, int *devInfo); cusolverStatus_t cusolverDnDgesvd ( cusolverDnHandle_t handle, signed char jobu, signed char jobvt, int m, int n, double *A, int lda, double *S, double *U, int ldu, double *VT, int ldvt, double *work, int lwork, double *rwork, int *devInfo);
cusolverStatus_t cusolverDnCgesvd ( cusolverDnHandle_t handle, signed char jobu, signed char jobvt, int m, int n, cuComplex *A, int lda, float *S, cuComplex *U, int ldu, cuComplex *VT, int ldvt, cuComplex *work, int lwork, float *rwork, int *devInfo); cusolverStatus_t cusolverDnZgesvd ( cusolverDnHandle_t handle, signed char jobu, signed char jobvt, int m, int n, cuDoubleComplex *A, int lda, double *S, cuDoubleComplex *U, int ldu, cuDoubleComplex *VT, int ldvt, cuDoubleComplex *work, int lwork, double *rwork, int *devInfo);
This function computes the singular value decomposition (SVD) of a m×n matrix A and corresponding the left and/or right singular vectors. The SVD is written
$A=U*\Sigma *{V}^{H}$ 
where Σ is an m×n matrix which is zero except for its min(m,n) diagonal elements, U is an m×m unitary matrix, and V is an n×n unitary matrix. The diagonal elements of Σ are the singular values of A; they are real and nonnegative, and are returned in descending order. The first min(m,n) columns of U and V are the left and right singular vectors of A.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by gesvd_bufferSize().
If output parameter devInfo = i (less than zero), the ith parameter is wrong (not counting handle). if bdsqr did not converge, devInfo specifies how many superdiagonals of an intermediate bidiagonal form did not converge to zero.
The rwork is real array of dimension (min(m,n)1). If devInfo>0 and rwork is not nil, rwork contains the unconverged superdiagonal elements of an upper bidiagonal matrix. This is slightly different from LAPACK which puts unconverged superdiagonal elements in work if type is real; in rwork if type is complex. rwork can be a NULL pointer if the user does not want the information from supperdiagonal.
Appendix G.1 provides a simple example of gesvd.
Remark 1: gesvd only supports m>=n.
Remark 2: the routine returns ${V}^{H}$ , not V.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
jobu  host  input  specifies options for computing all or part of the matrix U: = 'A': all m columns of U are returned in array U: = 'S': the first min(m,n) columns of U (the left singular vectors) are returned in the array U; = 'O': the first min(m,n) columns of U (the left singular vectors) are overwritten on the array A; = 'N': no columns of U (no left singular vectors) are computed. 
jobvt  host  input  specifies options for computing all or part of the matrix V**T: = 'A': all N rows of V**T are returned in the array VT; = 'S': the first min(m,n) rows of V**T (the right singular vectors) are returned in the array VT; = 'O': the first min(m,n) rows of V**T (the right singular vectors) are overwritten on the array A; = 'N': no rows of V**T (no right singular vectors) are computed. 
m  host  input  number of rows of matrix A. 
n  host  input  number of columns of matrix A. 
A  device  in/out  <type> array of dimension lda * n with lda is not less than max(1,m). On exit, the contents of A are destroyed. 
lda  host  input  leading dimension of twodimensional array used to store matrix A. 
S  device  output  real array of dimension min(m,n). The singular values of A, sorted so that S(i) >= S(i+1). 
U  device  output  <type> array of dimension ldu * m with ldu is not less than max(1,m). U contains the m×m unitary matrix U. 
ldu  host  input  leading dimension of twodimensional array used to store matrix U. 
VT  device  output  <type> array of dimension ldvt * n with ldvt is not less than max(1,n). VT contains the n×n unitary matrix V**T. 
ldvt  host  input  leading dimension of twodimensional array used to store matrix Vt. 
work  device  in/out  working space, <type> array of size lwork. 
lwork  host  input  size of work, returned by gesvd_bufferSize. 
rwork  device  input  real array of dimension min(m,n)1. It contains the unconverged superdiagonal elements of an upper bidiagonal matrix if devInfo > 0. 
devInfo  device  output  if devInfo = 0, the operation is successful. if devInfo = i, the ith parameter is wrong (not counting handle). if devInfo > 0, devInfo indicates how many superdiagonals of an intermediate bidiagonal form did not converge to zero. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (m,n<0 or lda<max(1,m) or ldu<max(1,m) or ldvt<max(1,n) ). 
CUSOLVER_STATUS_ARCH_MISMATCH  the device only supports compute capability 2.0 and above. 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>gesvdj()
cusolverStatus_t cusolverDnSgesvdj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, const float *A, int lda, const float *S, const float *U, int ldu, const float *V, int ldv, int *lwork, gesvdjInfo_t params); cusolverStatus_t cusolverDnDgesvdj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, const double *A, int lda, const double *S, const double *U, int ldu, const double *V, int ldv, int *lwork, gesvdjInfo_t params); cusolverStatus_t cusolverDnCgesvdj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, const cuComplex *A, int lda, const float *S, const cuComplex *U, int ldu, const cuComplex *V, int ldv, int *lwork, gesvdjInfo_t params); cusolverStatus_t cusolverDnZgesvdj_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, const cuDoubleComplex *A, int lda, const double *S, const cuDoubleComplex *U, int ldu, const cuDoubleComplex *V, int ldv, int *lwork, gesvdjInfo_t params);
cusolverStatus_t cusolverDnSgesvdj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, float *A, int lda, float *S, float *U, int ldu, float *V, int ldv, float *work, int lwork, int *info, gesvdjInfo_t params); cusolverStatus_t cusolverDnDgesvdj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, double *A, int lda, double *S, double *U, int ldu, double *V, int ldv, double *work, int lwork, int *info, gesvdjInfo_t params);
cusolverStatus_t cusolverDnCgesvdj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, cuComplex *A, int lda, float *S, cuComplex *U, int ldu, cuComplex *V, int ldv, cuComplex *work, int lwork, int *info, gesvdjInfo_t params); cusolverStatus_t cusolverDnZgesvdj( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int econ, int m, int n, cuDoubleComplex *A, int lda, double *S, cuDoubleComplex *U, int ldu, cuDoubleComplex *V, int ldv, cuDoubleComplex *work, int lwork, int *info, gesvdjInfo_t params);
This function computes the singular value decomposition (SVD) of a m×n matrix A and corresponding the left and/or right singular vectors. The SVD is written
$A=U*\Sigma *{V}^{H}$ 
where Σ is an m×n matrix which is zero except for its min(m,n) diagonal elements, U is an m×m unitary matrix, and V is an n×n unitary matrix. The diagonal elements of Σ are the singular values of A; they are real and nonnegative, and are returned in descending order. The first min(m,n) columns of U and V are the left and right singular vectors of A.
gesvdj has the same functionality as gesvd. The difference is that gesvd uses QR algorithm and gesvdj uses Jacobi method. The parallelism of Jacobi method gives GPU better performance on small and medium size matrices. Moreover the user can configure gesvdj to perform approximation up to certain accuracy.
gesvdj iteratively generates a sequence of unitary matrices to transform matrix A to the following form
${U}^{H}*A*V=S+E$ 
where S is diagonal and diagonal of E is zero.
During the iterations, the Frobenius norm of E decreases monotonically. As E goes down to zero, S is the set of singular values. In practice, Jacobi method stops if
${\mathrm{E}}_{F}\le eps*{\mathrm{A}}_{F}$ 
where eps is given tolerance.
gesvdj has two parameters to control the accuracy. First parameter is tolerance (eps). The default value is machine accuracy but The user can use function cusolverDnXgesvdjSetTolerance to set a priori tolerance. The second parameter is maximum number of sweeps which controls number of iterations of Jacobi method. The default value is 100 but the user can use function cusolverDnXgesvdjSetMaxSweeps to set a proper bound. The experimentis show 15 sweeps are good enough to converge to machine accuracy. gesvdj stops either tolerance is met or maximum number of sweeps is met.
Jacobi method has quadratic convergence, so the accuracy is not proportional to number of sweeps. To guarantee certain accuracy, the user should configure tolerance only.
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by gesvdj_bufferSize().
If output parameter info = i (less than zero), the ith parameter is wrong (not counting handle). If info = min(m,n)+1, gesvdj does not converge under given tolerance and maximum sweeps.
If the user sets an improper tolerance, gesvdj may not converge. For example, tolerance should not be smaller than machine accuracy.
Appendix G.2 provides a simple example of gesvdj.
Remark 1: gesvdj supports any combination of m and n.
Remark 2: the routine returns V, not ${V}^{H}$ . This is different from gesvd.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
jobz  host  input  specifies options to either compute singular value only or singular vectors as well: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute singular values only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute singular values and singular vectors. 
econ  host  input  econ = 1 for economy size for U and V. 
m  host  input  number of rows of matrix A. 
n  host  input  number of columns of matrix A. 
A  device  in/out  <type> array of dimension lda * n with lda is not less than max(1,m). On exit, the contents of A are destroyed. 
lda  host  input  leading dimension of twodimensional array used to store matrix A. 
S  device  output  real array of dimension min(m,n). The singular values of A, sorted so that S(i) >= S(i+1). 
U  device  output  <type> array of dimension ldu * m if econ is zero. If econ is nonzero, the dimension is ldu * min(m,n). U contains the left singular vectors. 
ldu  host  input  leading dimension of twodimensional array used to store matrix U. ldu is not less than max(1,m). 
V  device  output  <type> array of dimension ldv * n if econ is zero. If econ is nonzero, the dimension is ldv * min(m,n). V contains the right singular vectors. 
ldv  host  input  leading dimension of twodimensional array used to store matrix V. ldv is not less than max(1,n). 
work  device  in/out  <type> array of size lwork, working space. 
lwork  host  input  size of work, returned by gesvdj_bufferSize. 
info  device  output  if info = 0, the operation is successful. if info = i, the ith parameter is wrong (not counting handle). if info = min(m,n)+1, gesvdj dose not converge under given tolerance and maximum sweeps. 
params  host  in/out  structure filled with parameters of Jacobi algorithm and results of gesvdj. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (m,n<0 or lda<max(1,m) or ldu<max(1,m) or ldv<max(1,n) or jobz is not CUSOLVER_EIG_MODE_NOVECTOR or CUSOLVER_EIG_MODE_VECTOR ). 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>gesvdjBatched()
cusolverStatus_t cusolverDnSgesvdjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, const float *A, int lda, const float *S, const float *U, int ldu, const float *V, int ldv, int *lwork, gesvdjInfo_t params, int batchSize); cusolverStatus_t cusolverDnDgesvdjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, const double *A, int lda, const double *S, const double *U, int ldu, const double *V, int ldv, int *lwork, gesvdjInfo_t params, int batchSize); cusolverStatus_t cusolverDnCgesvdjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, const cuComplex *A, int lda, const float *S, const cuComplex *U, int ldu, const cuComplex *V, int ldv, int *lwork, gesvdjInfo_t params, int batchSize); cusolverStatus_t cusolverDnZgesvdjBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, const cuDoubleComplex *A, int lda, const double *S, const cuDoubleComplex *U, int ldu, const cuDoubleComplex *V, int ldv, int *lwork, gesvdjInfo_t params, int batchSize);
cusolverStatus_t cusolverDnSgesvdjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, float *A, int lda, float *S, float *U, int ldu, float *V, int ldv, float *work, int lwork, int *info, gesvdjInfo_t params, int batchSize); cusolverStatus_t cusolverDnDgesvdjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, double *A, int lda, double *S, double *U, int ldu, double *V, int ldv, double *work, int lwork, int *info, gesvdjInfo_t params, int batchSize);
cusolverStatus_t cusolverDnCgesvdjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, cuComplex *A, int lda, float *S, cuComplex *U, int ldu, cuComplex *V, int ldv, cuComplex *work, int lwork, int *info, gesvdjInfo_t params, int batchSize); cusolverStatus_t cusolverDnZgesvdjBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int m, int n, cuDoubleComplex *A, int lda, double *S, cuDoubleComplex *U, int ldu, cuDoubleComplex *V, int ldv, cuDoubleComplex *work, int lwork, int *info, gesvdjInfo_t params, int batchSize);
This function computes singular values and singular vectors of a squence of general m×n matrices
${A}_{j}={U}_{j}*{\Sigma}_{j}*{V}_{j}^{H}$ 
where ${\Sigma}_{j}$ is a real m×n diagonal matrix which is zero except for its min(m,n) diagonal elements. ${U}_{j}$ (left singular vectors) is a m×m unitary matrix and ${V}_{j}$ (right singular vectors) is a n×n unitary matrix. The diagonal elements of ${\Sigma}_{j}$ are the singular values of ${A}_{j}$ in either descending order or nonsorting order.
gesvdjBatched performs gesvdj on each matrix. It requires that all matrices are of the same size m,n no greater than 32 and are packed in contiguous way,
$A=\left(\begin{array}{ccc}\mathrm{A0}& \mathrm{A1}& \cdots \end{array}\right)$ 
Each matrix is columnmajor with leading dimension lda, so the formula for random access is ${A}_{k}(i,j)=\mathrm{A[\; i\; +\; lda*j\; +\; lda*n*k]}$ .
The parameter S also contains singular values of each matrix in contiguous way,
$S=\left(\begin{array}{ccc}\mathrm{S0}& \mathrm{S1}& \cdots \end{array}\right)$ 
The formula for random access of S is ${S}_{k}(j)=\mathrm{S[\; j\; +\; min(m,n)*k]}$ .
Except for tolerance and maximum sweeps, gesvdjBatched can either sort the singular values in descending order (default) or chose asis (without sorting) by the function cusolverDnXgesvdjSetSortEig. If the user packs several tiny matrices into diagonal blocks of one matrix, nonsorting option can separate singular values of those tiny matrices.
gesvdjBatched cannot report residual and executed sweeps by function cusolverDnXgesvdjGetResidual and cusolverDnXgesvdjGetSweeps. Any call of the above two returns CUSOLVER_STATUS_NOT_SUPPORTED. The user needs to compute residual explicitly.
The user has to provide working space pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by gesvdjBatched_bufferSize().
The output parameter info is an integer array of size batchSize. If the function returns CUSOLVER_STATUS_INVALID_VALUE, the first element info[0] = i (less than zero) indicates ith parameter is wrong (not counting handle). Otherwise, if info[i] = min(m,n)+1, gesvdjBatched does not converge on ith matrix under given tolerance and maximum sweeps.
Appendix G.3 provides a simple example of gesvdjBatched.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
jobz  host  input  specifies options to either compute singular value only or singular vectors as well: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute singular values only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute singular values and singular vectors. 
m  host  input  number of rows of matrix Aj. m is no greater than 32. 
n  host  input  number of columns of matrix Aj. n is no greater than 32. 
A  device  in/out  <type> array of dimension lda * n * batchSize with lda is not less than max(1,n). on Exit: the contents of Aj are destroyed. 
lda  host  input  leading dimension of twodimensional array used to store matrix Aj. 
S  device  output  a real array of dimension min(m,n)*batchSize. It stores the singular values of Aj in descending order or nonsorting order. 
U  device  output  <type> array of dimension ldu * m * batchSize. Uj contains the left singular vectors of Aj. 
ldu  host  input  leading dimension of twodimensional array used to store matrix Uj. ldu is not less than max(1,m). 
V  device  output  <type> array of dimension ldv * n * batchSize. Vj contains the right singular vectors of Aj. 
ldv  host  input  leading dimension of twodimensional array used to store matrix Vj. ldv is not less than max(1,n). 
work  device  in/out  <type> array of size lwork, working space. 
lwork  host  input  size of work, returned by gesvdjBatched_bufferSize. 
info  device  output  an integer array of dimension batchSize. If CUSOLVER_STATUS_INVALID_VALUE is returned, info[0] = i (less than zero) indicates ith parameter is wrong (not counting handle). Otherwise, if info[i] = 0, the operation is successful. if info[i] = min(m,n)+1, gesvdjBatched dose not converge on ith matrix under given tolerance and maximum sweeps. 
params  host  in/out  structure filled with parameters of Jacobi algorithm. 
batchSize  host  input  number of matrices. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (m,n<0 or lda<max(1,m) or ldu<max(1,m) or ldv<max(1,n) or jobz is not CUSOLVER_EIG_MODE_NOVECTOR or CUSOLVER_EIG_MODE_VECTOR , or batchSize<0 ). 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 
cusolverDn<t>gesvdaStridedBatched()
cusolverStatus_t cusolverDnSgesvdaStridedBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int rank, int m, int n, const float *A, int lda, long long int strideA, const float *S, long long int strideS, const float *U, int ldu, long long int strideU, const float *V, int ldv, long long int strideV, int *lwork, int batchSize); cusolverStatus_t cusolverDnDgesvdaStridedBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int rank, int m, int n, const double *A, int lda, long long int strideA, const double *S, long long int strideS, const double *U, int ldu, long long int strideU, const double *V, int ldv, long long int strideV, int *lwork, int batchSize); cusolverStatus_t cusolverDnCgesvdaStridedBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int rank, int m, int n, const cuComplex *A, int lda, long long int strideA, const float *S, long long int strideS, const cuComplex *U, int ldu, long long int strideU, const cuComplex *V, int ldv, long long int strideV, int *lwork, int batchSize); cusolverStatus_t cusolverDnZgesvdaStridedBatched_bufferSize( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int rank, int m, int n, const cuDoubleComplex *A, int lda, long long int strideA, const double *S, long long int strideS, const cuDoubleComplex *U, int ldu, long long int strideU, const cuDoubleComplex *V, int ldv, long long int strideV, int *lwork, int batchSize);
cusolverStatus_t cusolverDnSgesvdaStridedBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int rank, int m, int n, const float *A, int lda, long long int strideA, float *S, long long int strideS, float *U, int ldu, long long int strideU, float *V, int ldv, long long int strideV, float *work, int lwork, int *info, double *h_R_nrmF, int batchSize); cusolverStatus_t cusolverDnDgesvdaStridedBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int rank, int m, int n, const double *A, int lda, long long int strideA, double *S, long long int strideS, double *U, int ldu, long long int strideU, double *V, int ldv, long long int strideV, double *work, int lwork, int *info, double *h_R_nrmF, int batchSize);
cusolverStatus_t cusolverDnCgesvdaStridedBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int rank, int m, int n, const cuComplex *A, int lda, long long int strideA, float *S, long long int strideS, cuComplex *U, int ldu, long long int strideU, cuComplex *V, int ldv, long long int strideV, cuComplex *work, int lwork, int *info, double *h_R_nrmF, int batchSize); cusolverStatus_t cusolverDnZgesvdaStridedBatched( cusolverDnHandle_t handle, cusolverEigMode_t jobz, int rank, int m, int n, const cuDoubleComplex *A, int lda, long long int strideA, double *S, long long int strideS, cuDoubleComplex *U, int ldu, long long int strideU, cuDoubleComplex *V, int ldv, long long int strideV, cuDoubleComplex *work, int lwork, int *info, double *h_R_nrmF, int batchSize);
This function gesvda (a stands for approximate) approximates the singular value decomposition of a tall skinny m×n matrix A and corresponding the left and right singular vectors. The economy form of SVD is written by
$A=U*\Sigma *{V}^{H}$ 
where Σ is an n×n matrix. U is an m×n unitary matrix, and V is an n×n unitary matrix. The diagonal elements of Σ are the singular values of A; they are real and nonnegative, and are returned in descending order. U and V are the left and right singular vectors of A.
gesvda computes eigenvalues of A**T*A to approximate singular values and singular vectors. It generates matrices U and V and transforms the matrix A to the following form
${U}^{H}*A*V=S+E$ 
where S is diagonal and E depends on rounding errors. To certain conditions, U, V and S approximate singular values and singular vectors up to machine zero of single precision. In general, V is unitary, S is more accurate than U. If singular value is far from zero, then left singular vector U is accurate. In other words, the accuracy of singular values and left singular vectors depend on the distance between singular value and zero.
The input parameter rank decides the number of singualr values and singular vectors are computed in parameter S, U and V.
The output parameter h_RnrmF computes Frobenius norm of residual.
$AU*S*{V}^{H}$ 
if the paramter rank is equal n. Otherwise, h_RnrmF reports
$\mathrm{}U*S*{V}^{H}\mathrm{}\mathrm{S}$ 
in Frobenius norm sense. That is, how far U is from unitary.
gesvdaStridedBatched performs gesvda on each matrix. It requires that all matrices are of the same size m,n and are packed in contiguous way,
$A=\left(\begin{array}{ccc}\mathrm{A0}& \mathrm{A1}& \cdots \end{array}\right)$ 
Each matrix is columnmajor with leading dimension lda, so the formula for random access is ${A}_{k}(i,j)=\mathrm{A[\; i\; +\; lda*j\; +\; strideA*k]}$ . Similarly, the formula for random access of S is ${S}_{k}(j)=\mathrm{S[\; j\; +\; StrideS*k]}$ , the formula for random access of U is ${U}_{k}(i,j)=\mathrm{U[\; i\; +\; ldu*j\; +\; strideU*k]}$ and the formula for random access of V is ${V}_{k}(i,j)=\mathrm{V[\; i\; +\; ldv*j\; +\; strideV*k]}$ .
The user has to provide working space which is pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by gesvdaStridedBatched_bufferSize().
The output parameter info is an integer array of size batchSize. If the function returns CUSOLVER_STATUS_INVALID_VALUE, the first element info[0] = i (less than zero) indicates ith parameter is wrong (not counting handle). Otherwise, if info[i] = min(m,n)+1, gesvdaStridedBatched does not converge on ith matrix under given tolerance.
Appendix G.4 provides a simple example of gesvda.
Remark 1: the routine returns V, not ${V}^{H}$ . This is different from gesvd.
Remark 2: if the user is confident on the accuracy of singular values and singular vectors, for example, certain conditions hold (required singular value is far from zero), then the performance can be improved by passing null pointer to h_RnrmF, i.e. no computation of residual norm.
parameter  Memory  In/out  Meaning 
handle  host  input  handle to the cuSolverDN library context. 
jobz  host  input  specifies options to either compute singular value only or singular vectors as well: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute singular values only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute singular values and singular vectors. 
rank  host  input  number of singular values (from largest to smallest). 
m  host  input  number of rows of matrix Aj. 
n  host  input  number of columns of matrix Aj. 
A  device  input  <type> array of dimension strideA * batchSize with lda is not less than max(1,m). Aj is of dimension m * n. 
lda  host  input  leading dimension of twodimensional array used to store matrix Aj. 
strideA  host  input  value of type long long int that gives the address offset between A[i] and A[i+1]. strideA is not less than lda*n. 
S  device  output  a real array of dimension strideS*batchSize. It stores the singular values of Aj in descending order. Sj is of dimension rank * 1 
strideS  host  input  value of type long long int that gives the address offset between S[i] and S[i+1]. strideS is not less than rank. 
U  device  output  <type> array of dimension strideU * batchSize. Uj contains the left singular vectors of Aj. Uj is of dimension m * rank. 
ldu  host  input  leading dimension of twodimensional array used to store matrix Uj. ldu is not less than max(1,m). 
strideU  host  input  value of type long long int that gives the address offset between U[i] and U[i+1]. strideU is not less than ldu*rank. 
V  device  output  <type> array of dimension strideV * batchSize. Vj contains the right singular vectors of Aj. Vj is of dimension n * rank. 
ldv  host  input  leading dimension of twodimensional array used to store matrix Vj. ldv is not less than max(1,n). 
strideV  host  input  value of type long long int that gives the address offset between V[i] and V[i+1]. strideV is not less than ldv*rank. 
work  device  in/out  <type> array of size lwork, working space. 
lwork  host  input  size of work, returned by gesvdaStridedBatched_bufferSize. 
info  device  output  an integer array of dimension batchSize. If CUSOLVER_STATUS_INVALID_VALUE is returned, info[0] = i (less than zero) indicates ith parameter is wrong (not counting handle). Otherwise, if info[i] = 0, the operation is successful. if info[i] = min(m,n)+1, gesvdaStridedBatched dose not converge on ith matrix. 
h_RnrmF  host  output  <double> array of size batchSize. h_RnrmF[i] is norm of residual of ith matrix. 
batchSize  host  input  number of matrices. batchSize is not less than 1. 
CUSOLVER_STATUS_SUCCESS  the operation completed successfully. 
CUSOLVER_STATUS_NOT_INITIALIZED  the library was not initialized. 
CUSOLVER_STATUS_INVALID_VALUE  invalid parameters were passed (m,n<0 or lda<max(1,m) or ldu<max(1,m) or ldv<max(1,n) or strideA<lda*n or strideS<rank or strideU<ldu*rank or strideV<ldv*rank or batchSize<1 or jobz is not CUSOLVER_EIG_MODE_NOVECTOR or CUSOLVER_EIG_MODE_VECTOR ). 
CUSOLVER_STATUS_INTERNAL_ERROR  an internal operation failed. 