## 1. Introduction

The cuSolver library is a high-level package based on the cuBLAS and cuSPARSE libraries. It combines three separate libraries under a single umbrella, each of which can be used independently or in concert with other toolkit libraries.

The intent of cuSolver is to provide useful LAPACK-like features, such as common matrix factorization and triangular solve routines for dense matrices, a sparse least-squares solver and an eigenvalue solver. In addition cuSolver provides a new refactorization library useful for solving sequences of matrices with a shared sparsity pattern.

The first part of cuSolver is called cuSolverDN, and deals with dense matrix factorization and solve routines such as LU, QR, SVD and LDLT, as well as useful utilities such as matrix and vector permutations.

Next, cuSolverSP provides a new set of sparse routines based on a sparse QR factorization. Not all matrices have a good sparsity pattern for parallelism in factorization, so the cuSolverSP library also provides a CPU path to handle those sequential-like matrices. For those matrices with abundant parallelism, the GPU path will deliver higher performance. The library is designed to be called from C and C++.

The final part is cuSolverRF, a sparse re-factorization package that can provide very good performance when solving a sequence of matrices where only the coefficients are changed but the sparsity pattern remains the same.

The GPU path of the cuSolver library assumes data is already in the device memory. It is the responsibility of the developer to allocate memory and to copy data between GPU memory and CPU memory using standard CUDA runtime API routines, such as cudaMalloc(), cudaFree(), cudaMemcpy(), and cudaMemcpyAsync().

Note: The cuSolver library requires hardware with a CUDA compute capability (CC) of at least 2.0 or higher. Please see the NVIDIA CUDA C Programming Guide, Appendix A for a list of the compute capabilities corresponding to all NVIDIA GPUs.

### cuSolverDN: Dense LAPACK

The cuSolverDN library was designed to solve dense linear systems of the form

 $Ax=b$

where the coefficient matrix $A\in {R}^{\mathrm{nxn}}$ , right-hand-side vector $b\in {R}^{n}$ and solution vector $x\in {R}^{n}$

The cuSolverDN library provides QR factorization and LU with partial pivoting to handle a general matrix A, which may be non-symmetric. Cholesky factorization is also provided for symmetric/Hermitian matrices. For symmetric indefinite matrices, we provide Bunch-Kaufman (LDL) factorization.

The cuSolverDN library also provides a helpful bidiagonalization routine and singular value decomposition (SVD).

The cuSolverDN library targets computationally-intensive and popular routines in LAPACK, and provides an API compatible with LAPACK. The user can accelerate these time-consuming routines with cuSolverDN and keep others in LAPACK without a major change to existing code.

### cuSolverSP: Sparse LAPACK

The cuSolverSP library was mainly designed to a solve sparse linear system

 $Ax=b$

and the least-squares problem

 $x=\mathrm{argmin}\mathrm{||}A*z-b\mathrm{||}$

where sparse matrix $A\in {R}^{\mathrm{mxn}}$ , right-hand-side vector $b\in {R}^{m}$ and solution vector $x\in {R}^{n}$ . For a linear system, we require m=n.

The core algorithm is based on sparse QR factorization. The matrix A is accepted in CSR format. If matrix A is symmetric/Hermitian, the user has to provide a full matrix, ie fill missing lower or upper part.

If matrix A is symmetric positive definite and the user only needs to solve $Ax=b$ , Cholesky factorization can work and the user only needs to provide the lower triangular part of A.

On top of the linear and least-squares solvers, the cuSolverSP library provides a simple eigenvalue solver based on shift-inverse power method, and a function to count the number of eigenvalues contained in a box in the complex plane.

### cuSolverRF: Refactorization

The cuSolverRF library was designed to accelerate solution of sets of linear systems by fast re-factorization when given new coefficients in the same sparsity pattern

 ${A}_{i}{x}_{i}={f}_{i}$

where a sequence of coefficient matrices ${A}_{i}\in {R}^{\mathrm{nxn}}$ , right-hand-sides ${f}_{i}\in {R}^{n}$ and solutions ${x}_{i}\in {R}^{n}$ are given for i=1,...,k.

The cuSolverRF library is applicable when the sparsity pattern of the coefficient matrices ${A}_{i}$ as well as the reordering to minimize fill-in and the pivoting used during the LU factorization remain the same across these linear systems. In that case, the first linear system (i=1) requires a full LU factorization, while the subsequent linear systems (i=2,...,k) require only the LU re-factorization. The later can be performed using the cuSolverRF library.

Notice that because the sparsity pattern of the coefficient matrices, the reordering and pivoting remain the same, the sparsity pattern of the resulting triangular factors ${L}_{i}$ and ${U}_{i}$ also remains the same. Therefore, the real difference between the full LU factorization and LU re-factorization is that the required memory is known ahead of time.

### 1.4. Naming Conventions

The cuSolverDN library functions are available for data types float, double, cuComplex, and cuDoubleComplex. The naming convention is as follows:

 cusolverDn

where <t> can be S, D, C, Z, or X, corresponding to the data types float, double, cuComplex, cuDoubleComplex, and the generic type, respectively. <operation> can be Cholesky factorization (potrf), LU with partial pivoting (getrf), QR factorization (geqrf) and Bunch-Kaufman factorization (sytrf).

The cuSolverSP library functions are available for data types float, double, cuComplex, and cuDoubleComplex. The naming convention is as follows:

 cusolverSp[Host][][]

where cuSolverSp is the GPU path and cusolverSpHost is the corresponding CPU path. <t> can be S, D, C, Z, or X, corresponding to the data types float, double, cuComplex, cuDoubleComplex, and the generic type, respectively.

The <matrix data format> is csr, compressed sparse row format.

The <operation> can be ls, lsq, eig, eigs, corresponding to linear solver, least-square solver, eigenvalue solver and number of eigenvalues in a box, respectively.

The <output matrix data format> can be v or m, corresponding to a vector or a matrix.

<based on> describes which algorithm is used. For example, qr (sparse QR factorization) is used in linear solver and least-square solver.

All of the functions have the return type cusolverStatus_t and are explained in more detail in the chapters that follow.

cuSolverSP API
 routine data format operation output format based on csrlsvlu csr linear solver (ls) vector (v) LU (lu) with partial pivoting csrlsvqr csr linear solver (ls) vector (v) QR factorization (qr) csrlsvchol csr linear solver (ls) vector (v) Cholesky factorization (chol) csrlsqvqr csr least-square solver (lsq) vector (v) QR factorization (qr) csreigvsi csr eigenvalue solver (eig) vector (v) shift-inverse csreigs csr number of eigenvalues in a box (eigs) csrsymrcm csr Symmetric Reverse Cuthill-McKee (symrcm)

The cuSolverRF library routines are available for data type double. Most of the routines follow the naming convention:

 cusolverRf__[[Host]](...)

where the trailing optional Host qualifier indicates the data is accessed on the host versus on the device, which is the default. The <operation> can be Setup, Analyze, Refactor, Solve, ResetValues, AccessBundledFactors and ExtractSplitFactors.

Finally, the return type of the cuSolverRF library routines is cusolverStatus_t.

### 1.5. Asynchronous Execution

The cuSolver library functions prefer to keep asynchronous execution as much as possible. Developers can always use the cudaDeviceSynchronize() function to ensure that the execution of a particular cuSolver library routine has completed.

A developer can also use the cudaMemcpy() routine to copy data from the device to the host and vice versa, using the cudaMemcpyDeviceToHost and cudaMemcpyHostToDevice parameters, respectively. In this case there is no need to add a call to cudaDeviceSynchronize() because the call to cudaMemcpy() with the above parameters is blocking and completes only when the results are ready on the host.

### 1.6. Library Property

The libraryPropertyType data type is an enumeration of library property types. (ie. CUDA version X.Y.Z would yield MAJOR_VERSION=X, MINOR_VERSION=Y, PATCH_LEVEL=Z)

```typedef enum libraryPropertyType_t
{
MAJOR_VERSION,
MINOR_VERSION,
PATCH_LEVEL
} libraryPropertyType;
```

The following code can show the version of cusolver library.

```    int major=-1,minor=-1,patch=-1;
cusolverGetProperty(MAJOR_VERSION, &major);
cusolverGetProperty(MINOR_VERSION, &minor);
cusolverGetProperty(PATCH_LEVEL, &patch);
printf("CUSOLVER Version (Major,Minor,PatchLevel): %d.%d.%d\n", major,minor,patch);
```

## 2. Using the cuSolver API

This chapter describes how to use the cuSolver library API. It is not a reference for the cuSolver API data types and functions; that is provided in subsequent chapters.

The library is thread-safe, and its functions can be called from multiple host threads.

### 2.2. Scalar Parameters

In the cuSolver API, the scalar parameters can be passed by reference on the host.

### 2.3. Parallelism with Streams

If the application performs several small independent computations, or if it makes data transfers in parallel with the computation, then CUDA streams can be used to overlap these tasks.

The application can conceptually associate a stream with each task. To achieve the overlap of computation between the tasks, the developer should:
1. Create CUDA streams using the function cudaStreamCreate(), and
2. Set the stream to be used by each individual cuSolver library routine by calling, for example, cusolverDnSetStream(), just prior to calling the actual cuSolverDN routine.

The computations performed in separate streams would then be overlapped automatically on the GPU, when possible. This approach is especially useful when the computation performed by a single task is relatively small, and is not enough to fill the GPU with work, or when there is a data transfer that can be performed in parallel with the computation.

## 3. cuSolver Types Reference

### 3.1. cuSolverDN Types

The float, double, cuComplex, and cuDoubleComplex data types are supported. The first two are standard C data types, while the last two are exported from cuComplex.h. In addition, cuSolverDN uses some familiar types from cuBlas.

### 3.1.1. cusolverDnHandle_t

This is a pointer type to an opaque cuSolverDN context, which the user must initialize by calling cusolverDnCreate() prior to calling any other library function. An un-initialized Handle object will lead to unexpected behavior, including crashes of cuSolverDN. The handle created and returned by cusolverDnCreate() must be passed to every cuSolverDN function.

### 3.1.2. cublasFillMode_t

The type indicates which part (lower or upper) of the dense matrix was filled and consequently should be used by the function. Its values correspond to Fortran characters ‘L’ or ‘l’ (lower) and ‘U’ or ‘u’ (upper) that are often used as parameters to legacy BLAS implementations.

 Value Meaning CUBLAS_FILL_MODE_LOWER the lower part of the matrix is filled CUBLAS_FILL_MODE_UPPER the upper part of the matrix is filled

### 3.1.3. cublasOperation_t

The cublasOperation_t type indicates which operation needs to be performed with the dense matrix. Its values correspond to Fortran characters ‘N’ or ‘n’ (non-transpose), ‘T’ or ‘t’ (transpose) and ‘C’ or ‘c’ (conjugate transpose) that are often used as parameters to legacy BLAS implementations.

 Value Meaning CUBLAS_OP_N the non-transpose operation is selected CUBLAS_OP_T the transpose operation is selected CUBLAS_OP_C the conjugate transpose operation is selected

### 3.1.4. cusolverEigType_t

The cusolverEigType_t type indicates which type of eigenvalue solver is. Its values correspond to Fortran integer 1 (A*x = lambda*B*x), 2 (A*B*x = lambda*x), 3 (B*A*x = lambda*x), used as parameters to legacy LAPACK implementations.

 Value Meaning CUSOLVER_EIG_TYPE_1 A*x = lambda*B*x CUSOLVER_EIG_TYPE_2 A*B*x = lambda*x CUSOLVER_EIG_TYPE_3 B*A*x = lambda*x

### 3.1.5. cusolverEigMode_t

The cusolverEigMode_t type indicates whether or not eigenvectors are computed. Its values correspond to Fortran character 'N' (only eigenvalues are computed), 'V' (both eigenvalues and eigenvectors are computed) used as parameters to legacy LAPACK implementations.

 Value Meaning CUSOLVER_EIG_MODE_NOVECTOR only eigenvalues are computed CUSOLVER_EIG_MODE_VECTOR both eigenvalues and eigenvectors are computed

### 3.1.6. cusolverStatus_t

This is the same as cusolverStatus_t in the sparse LAPACK section.

### 3.2. cuSolverSP Types

The float, double, cuComplex, and cuDoubleComplex data types are supported. The first two are standard C data types, while the last two are exported from cuComplex.h.

### 3.2.1. cusolverSpHandle_t

This is a pointer type to an opaque cuSolverSP context, which the user must initialize by calling cusolverSpCreate() prior to calling any other library function. An un-initialized Handle object will lead to unexpected behavior, including crashes of cuSolverSP. The handle created and returned by cusolverSpCreate() must be passed to every cuSolverSP function.

### 3.2.2. cusparseMatDescr_t

We have chosen to keep the same structure as exists in cuSparse to describe the shape and properties of a matrix. This enables calls to either cuSparse or cuSolver using the same matrix description.

```typedef struct {
cusparseMatrixType_t MatrixType;
cusparseFillMode_t FillMode;
cusparseDiagType_t DiagType;
cusparseIndexBase_t IndexBase;
} cusparseMatDescr_t;```

Please read documenation of CUSPARSE Library to understand each field of cusparseMatDescr_t.

### 3.2.3. cusolverStatus_t

This is a status type returned by the library functions and it can have the following values.

 CUSOLVER_STATUS_SUCCESS The operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED The cuSolver library was not initialized. This is usually caused by the lack of a prior call, an error in the CUDA Runtime API called by the cuSolver routine, or an error in the hardware setup. To correct: call cusolverCreate() prior to the function call; and check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. CUSOLVER_STATUS_ALLOC_FAILED Resource allocation failed inside the cuSolver library. This is usually caused by a cudaMalloc() failure. To correct: prior to the function call, deallocate previously allocated memory as much as possible. CUSOLVER_STATUS_INVALID_VALUE An unsupported value or parameter was passed to the function (a negative vector size, for example). To correct: ensure that all the parameters being passed have valid values. CUSOLVER_STATUS_ARCH_MISMATCH The function requires a feature absent from the device architecture; usually caused by the lack of support for atomic operations or double precision. To correct: compile and run the application on a device with compute capability 2.0 or above. CUSOLVER_STATUS_EXECUTION_FAILED The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons. To correct: check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. CUSOLVER_STATUS_INTERNAL_ERROR An internal cuSolver operation failed. This error is usually caused by a cudaMemcpyAsync() failure. To correct: check that the hardware, an appropriate version of the driver, and the cuSolver library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine’s completion. CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED The matrix type is not supported by this function. This is usually caused by passing an invalid matrix descriptor to the function. To correct: check that the fields in descrA were set correctly.

### 3.3. cuSolverRF Types

cuSolverRF only supports double.

### cusolverRfHandle_t

The cusolverRfHandle_t is a pointer to an opaque data structure that contains the cuSolverRF library handle. The user must initialize the handle by calling cusolverRfCreate() prior to any other cuSolverRF library calls. The handle is passed to all other cuSolverRF library calls.

### cusolverRfMatrixFormat_t

The cusolverRfMatrixFormat_t is an enum that indicates the input/output matrix format assumed by the cusolverRfSetupDevice(), cusolverRfSetupHost(), cusolverRfResetValues(), cusolveRfExtractBundledFactorsHost() and cusolverRfExtractSplitFactorsHost() routines.

 Value Meaning CUSOLVER_MATRIX_FORMAT_CSR matrix format CSR is assumed. (default) CUSOLVER_MATRIX_FORMAT_CSC matrix format CSC is assumed.

### cusolverRfNumericBoostReport_t

The cusolverRfNumericBoostReport_t is an enum that indicates whether numeric boosting (of the pivot) was used during the cusolverRfRefactor() and cusolverRfSolve() routines. The numeric boosting is disabled by default.

 Value Meaning CUSOLVER_NUMERIC_BOOST_NOT_USED numeric boosting not used. (default) CUSOLVER_NUMERIC_BOOST_USED numeric boosting used.

### cusolverRfResetValuesFastMode_t

The cusolverRfResetValuesFastMode_t is an enum that indicates the mode used for the cusolverRfResetValues() routine. The fast mode requires extra memory and is recommended only if very fast calls to cusolverRfResetValues() are needed.

 Value Meaning CUSOLVER_RESET_VALUES_FAST_MODE_OFF fast mode disabled. (default) CUSOLVER_RESET_VALUES_FAST_MODE_ON fast mode enabled.

### cusolverRfFactorization_t

The cusolverRfFactorization_t is an enum that indicates which (internal) algorithm is used for refactorization in the cusolverRfRefactor() routine.

 Value Meaning CUSOLVER_FACTORIZATION_ALG0 algorithm 0. (default) CUSOLVER_FACTORIZATION_ALG1 algorithm 1. CUSOLVER_FACTORIZATION_ALG2 algorithm 2. Domino-based scheme.

### cusolverRfTriangularSolve_t

The cusolverRfTriangularSolve_t is an enum that indicates which (internal) algorithm is used for triangular solve in the cusolverRfSolve() routine.

 Value Meaning CUSOLVER_TRIANGULAR_SOLVE_ALG0 algorithm 0. CUSOLVER_TRIANGULAR_SOLVE_ALG1 algorithm 1. (default) CUSOLVER_TRIANGULAR_SOLVE_ALG2 algorithm 2. Domino-based scheme. CUSOLVER_TRIANGULAR_SOLVE_ALG3 algorithm 3. Domino-based scheme.

### cusolverRfUnitDiagonal_t

The cusolverRfUnitDiagonal_t is an enum that indicates whether and where the unit diagonal is stored in the input/output triangular factors in the cusolverRfSetupDevice(), cusolverRfSetupHost() and cusolverRfExtractSplitFactorsHost() routines.

 Value Meaning CUSOLVER_UNIT_DIAGONAL_STORED_L unit diagonal is stored in lower triangular factor. (default) CUSOLVER_UNIT_DIAGONAL_STORED_U unit diagonal is stored in upper triangular factor. CUSOLVER_UNIT_DIAGONAL_ASSUMED_L unit diagonal is assumed in lower triangular factor. CUSOLVER_UNIT_DIAGONAL_ASSUMED_U unit diagonal is assumed in upper triangular factor.

### cusolverStatus_t

The cusolverStatus_t is an enum that indicates success or failure of the cuSolverRF library call. It is returned by all the cuSolver library routines, and it uses the same enumerated values as the sparse and dense Lapack routines.

## 4. cuSolver Formats Reference

### 4.1. Index Base Format

The CSR or CSC format requires either zero-based or one-based index for a sparse matrix A. The GLU library supports only zero-based indexing. Otherwise, both one-based and zero-based indexing are supported in cuSolver.

### 4.2. Vector (Dense) Format

The vectors are assumed to be stored linearly in memory. For example, the vector

 $x=\left(\begin{array}{c}{x}_{1}\\ {x}_{2}\\ ⋮\\ {x}_{n}\end{array}\right)$

is represented as

 $\left(\begin{array}{cccc}{x}_{1}& {x}_{2}& \dots & {x}_{n}\end{array}\right)$

### 4.3. Matrix (Dense) Format

The dense matrices are assumed to be stored in column-major order in memory. The sub-matrix can be accessed using the leading dimension of the original matrix. For examle, the m*n (sub-)matrix

 $\left(\begin{array}{ccc}{a}_{1,1}& \dots & {a}_{1,n}\\ {a}_{2,1}& \dots & {a}_{2,n}\\ ⋮\\ {a}_{m,1}& \dots & {a}_{m,n}\end{array}\right)$

is represented as

 $\left(\begin{array}{ccc}{a}_{1,1}& \dots & {a}_{1,n}\\ {a}_{2,1}& \dots & {a}_{2,n}\\ ⋮& \ddots & ⋮\\ {a}_{m,1}& \dots & {a}_{m,n}\\ ⋮& \ddots & ⋮\\ {a}_{\mathrm{lda},1}& \dots & {a}_{\mathrm{lda},n}\end{array}\right)$

with its elements arranged linearly in memory as

 $\left(\begin{array}{ccccccccccccc}{a}_{1,1}& {a}_{2,1}& \dots & {a}_{m,1}& \dots & {a}_{\mathrm{lda},1}& \dots & {a}_{1,n}& {a}_{2,n}& \dots & {a}_{m,n}& \dots & {a}_{\mathrm{lda},n}\end{array}\right)$

where ldam is the leading dimension of A.

### 4.4. Matrix (CSR) Format

In CSR format the matrix is represented by the following parameters

 parameter type size Meaning n (int) the number of rows (and columns) in the matrix. nnz (int) the number of non-zero elements in the matrix. csrRowPtr (int *) n+1 the array of offsets corresponding to the start of each row in the arrays csrColInd and csrVal. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix. csrColInd (int *) nnz the array of column indices corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row. csrVal (S|D|C|Z)* nnz the array of values corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by row and by column within each row.

Note that in our CSR format sparse matrices are assumed to be stored in row-major order, in other words, the index arrays are first sorted by row indices and then within each row by column indices. Also it is assumed that each pair of row and column indices appears only once.

For example, the 4x4 matrix

 $A=\left(\begin{array}{cccc}\mathrm{1.0}& \mathrm{3.0}& \mathrm{0.0}& \mathrm{0.0}\\ \mathrm{0.0}& \mathrm{4.0}& \mathrm{6.0}& \mathrm{0.0}\\ \mathrm{2.0}& \mathrm{5.0}& \mathrm{7.0}& \mathrm{8.0}\\ \mathrm{0.0}& \mathrm{0.0}& \mathrm{0.0}& \mathrm{9.0}\end{array}\right)$

is represented as

 $\mathrm{csrRowPtr}=\left(\begin{array}{ccccc}0& 2& 4& 8& 9\end{array}\right)$
 $\mathrm{csrColInd}=\left(\begin{array}{ccccccccc}0& 1& 1& 2& 0& 1& 2& 3& 3\end{array}\right)$
 $\mathrm{csrVal}=\left(\begin{array}{ccccccccc}1.0& 3.0& 4.0& 6.0& 2.0& 5.0& 7.0& 8.0& 9.0\end{array}\right)$

### 4.5. Matrix (CSC) Format

In CSC format the matrix is represented by the following parameters

 parameter type size Meaning n (int) the number of rows (and columns) in the matrix. nnz (int) the number of non-zero elements in the matrix. cscColPtr (int *) n+1 the array of offsets corresponding to the start of each column in the arrays cscRowInd and cscVal. This array has also an extra entry at the end that stores the number of non-zero elements in the matrix. cscRowInd (int *) nnz the array of row indices corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by column and by row within each column. cscVal (S|D|C|Z)* nnz the array of values corresponding to the non-zero elements in the matrix. It is assumed that this array is sorted by column and by row within each column.

Note that in our CSC format sparse matrices are assumed to be stored in column-major order, in other words, the index arrays are first sorted by column indices and then within each column by row indices. Also it is assumed that each pair of row and column indices appears only once.

For example, the 4x4 matrix

 $A=\left(\begin{array}{cccc}\mathrm{1.0}& \mathrm{3.0}& \mathrm{0.0}& \mathrm{0.0}\\ \mathrm{0.0}& \mathrm{4.0}& \mathrm{6.0}& \mathrm{0.0}\\ \mathrm{2.0}& \mathrm{5.0}& \mathrm{7.0}& \mathrm{8.0}\\ \mathrm{0.0}& \mathrm{0.0}& \mathrm{0.0}& \mathrm{9.0}\end{array}\right)$

is represented as

 $\mathrm{cscColPtr}=\left(\begin{array}{ccccc}0& 2& 5& 7& 9\end{array}\right)$
 $\mathrm{cscRowInd}=\left(\begin{array}{ccccccccc}0& 2& 0& 1& 2& 1& 2& 2& 3\end{array}\right)$
 $\mathrm{cscVal}=\left(\begin{array}{ccccccccc}1.0& 2.0& 3.0& 4.0& 5.0& 6.0& 7.0& 8.0& 9.0\end{array}\right)$

## cuSolverDN: dense LAPACK Function Reference

This chapter describes the API of cuSolverDN, which provides a subset of dense LAPACK functions.

### cuSolverDN Helper Function Reference

The cuSolverDN helper functions are described in this section.

### 5.1.1. cusolverDnCreate()

```cusolverStatus_t
cusolverDnCreate(cusolverDnHandle_t *handle);
```

This function initializes the cuSolverDN library and creates a handle on the cuSolverDN context. It must be called before any other cuSolverDN API function is invoked. It allocates hardware resources necessary for accessing the GPU.

 parameter Memory In/out Meaning handle host output the pointer to the handle to the cuSolverDN context.
Status Returned
 CUSOLVER_STATUS_SUCCESS the initialization succeeded. CUSOLVER_STATUS_NOT_INITIALIZED the CUDA Runtime initialization failed. CUSOLVER_STATUS_ALLOC_FAILED the resources could not be allocated. CUSOLVER_STATUS_ARCH_MISMATCH the device only supports compute capability 2.0 and above.

### 5.1.2. cusolverDnDestroy()

```cusolverStatus_t
cusolverDnDestroy(cusolverDnHandle_t handle);
```

This function releases CPU-side resources used by the cuSolverDN library.

 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context.
Status Returned
 CUSOLVER_STATUS_SUCCESS the shutdown succeeded. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized.

### cusolverDnSetStream()

```cusolverStatus_t
cusolverDnSetStream(cusolverDnHandle_t handle, cudaStream_t streamId)
```

This function sets the stream to be used by the cuSolverDN library to execute its routines.

 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. streamId host input the stream to be used by the library.
Status Returned
 CUSOLVER_STATUS_SUCCESS the stream was set successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized.

### cusolverDnGetStream()

```cusolverStatus_t
cusolverDnGetStream(cusolverDnHandle_t handle, cudaStream_t *streamId)
```

This function sets the stream to be used by the cuSolverDN library to execute its routines.

 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. streamId host output the stream to be used by the library.
Status Returned
 CUSOLVER_STATUS_SUCCESS the stream was set successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized.

### 5.1.5. cusolverDnCreateSyevjInfo()

```cusolverStatus_t
cusolverDnCreateSyevjInfo(
syevjInfo_t *info);
```

This function creates and initializes the structure of syevj, syevjBatched and sygvj to default values.

 parameter Memory In/out Meaning info host output the pointer to the structure of syevj.
Status Returned
 CUSOLVER_STATUS_SUCCESS the structure was initialized successfully. CUSOLVER_STATUS_ALLOC_FAILED the resources could not be allocated.

### 5.1.6. cusolverDnDestroySyevjInfo()

```cusolverStatus_t
cusolverDnDestroySyevjInfo(
syevjInfo_t info);
```

This function destroys and releases any memory required by the structure.

 parameter Memory In/out Meaning info host input the structure of syevj.
Status Returned
 CUSOLVER_STATUS_SUCCESS the resources are released successfully.

### 5.1.7. cusolverDnXsyevjSetTolerance()

```cusolverStatus_t
cusolverDnXsyevjSetTolerance(
syevjInfo_t info,
double tolerance)
```

This function configures tolerance of syevj.

 parameter Memory In/out Meaning info host in/out the pointer to the structure of syevj. tolerance host input accuracy of numerical eigenvalues.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully.

### 5.1.8. cusolverDnXsyevjSetMaxSweeps()

```cusolverStatus_t
cusolverDnXsyevjSetMaxSweeps(
syevjInfo_t info,
int max_sweeps)
```

This function configures maximum number of sweeps in syevj. The default value is 100.

 parameter Memory In/out Meaning info host in/out the pointer to the structure of syevj. max_sweeps host input maximum number of sweeps.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully.

### 5.1.9. cusolverDnXsyevjSetSortEig()

```cusolverStatus_t
cusolverDnXsyevjSetSortEig(
syevjInfo_t info,
int sort_eig)
```

if sort_eig is zero, the eigenvalues are not sorted. This function only works for syevjBatched. syevj and sygvj always sort eigenvalues in ascending order. By default, eigenvalues are always sorted in ascending order.

 parameter Memory In/out Meaning info host in/out the pointer to the structure of syevj. sort_eig host input if sort_eig is zero, the eigenvalues are not sorted.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully.

### 5.1.10. cusolverDnXsyevjGetResidual()

```cusolverStatus_t
cusolverDnXsyevjGetResidual(
cusolverDnHandle_t handle,
syevjInfo_t info,
double *residual)
```

This function reports residual of syevj or sygvj. It does not support syevjBatched. If the user calls this function after syevjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.

 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. info host input the pointer to the structure of syevj. residual host output residual of syevj.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_SUPPORTED does not support batched version

### 5.1.11. cusolverDnXsyevjGetSweeps()

```cusolverStatus_t
cusolverDnXsyevjGetSweeps(
cusolverDnHandle_t handle,
syevjInfo_t info,
int *executed_sweeps)
```

This function reports number of executed sweeps of syevj or sygvj. It does not support syevjBatched. If the user calls this function after syevjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.

 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. info host input the pointer to the structure of syevj. executed_sweeps host output number of executed sweeps.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_SUPPORTED does not support batched version

### 5.1.12. cusolverDnCreateGesvdjInfo()

```cusolverStatus_t
cusolverDnCreateGesvdjInfo(
gesvdjInfo_t *info);
```

This function creates and initializes the structure of gesvdj and gesvdjBatched to default values.

 parameter Memory In/out Meaning info host output the pointer to the structure of gesvdj.
Status Returned
 CUSOLVER_STATUS_SUCCESS the structure was initialized successfully. CUSOLVER_STATUS_ALLOC_FAILED the resources could not be allocated.

### 5.1.13. cusolverDnDestroyGesvdjInfo()

```cusolverStatus_t
cusolverDnDestroyGesvdjInfo(
gesvdjInfo_t info);
```

This function destroys and releases any memory required by the structure.

 parameter Memory In/out Meaning info host input the structure of gesvdj.
Status Returned
 CUSOLVER_STATUS_SUCCESS the resources are released successfully.

### 5.1.14. cusolverDnXgesvdjSetTolerance()

```cusolverStatus_t
cusolverDnXgesvdjSetTolerance(
gesvdjInfo_t info,
double tolerance)
```

This function configures tolerance of gesvdj.

 parameter Memory In/out Meaning info host in/out the pointer to the structure of gesvdj. tolerance host input accuracy of numerical singular values.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully.

### 5.1.15. cusolverDnXgesvdjSetMaxSweeps()

```cusolverStatus_t
cusolverDnXgesvdjSetMaxSweeps(
gesvdjInfo_t info,
int max_sweeps)
```

This function configures maximum number of sweeps in gesvdj. The default value is 100.

 parameter Memory In/out Meaning info host in/out the pointer to the structure of gesvdj. max_sweeps host input maximum number of sweeps.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully.

### 5.1.16. cusolverDnXgesvdjSetSortEig()

```cusolverStatus_t
cusolverDnXgesvdjSetSortEig(
gesvdjInfo_t info,
int sort_svd)
```

if sort_svd is zero, the singular values are not sorted. This function only works for gesvdjBatched. gesvdj always sorts singular values in descending order. By default, singular values are always sorted in descending order.

 parameter Memory In/out Meaning info host in/out the pointer to the structure of gesvdj. sort_svd host input if sort_svd is zero, the singular values are not sorted.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully.

### 5.1.17. cusolverDnXgesvdjGetResidual()

```cusolverStatus_t
cusolverDnXgesvdjGetResidual(
cusolverDnHandle_t handle,
gesvdjInfo_t info,
double *residual)
```

This function reports residual of gesvdj. It does not support gesvdjBatched. If the user calls this function after gesvdjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.

 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. info host input the pointer to the structure of gesvdj. residual host output residual of gesvdj.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_SUPPORTED does not support batched version

### 5.1.18. cusolverDnXgesvdjGetSweeps()

```cusolverStatus_t
cusolverDnXgesvdjGetSweeps(
cusolverDnHandle_t handle,
gesvdjInfo_t info,
int *executed_sweeps)
```

This function reports number of executed sweeps of gesvdj. It does not support gesvdjBatched. If the user calls this function after gesvdjBatched, the error CUSOLVER_STATUS_NOT_SUPPORTED is returned.

 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. info host input the pointer to the structure of gesvdj. executed_sweeps host output number of executed sweeps.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_SUPPORTED does not support batched version

### Dense Linear Solver Reference

This chapter describes linear solver API of cuSolverDN, including Cholesky factorization, LU with partial pivoting, QR factorization and Bunch-Kaufman (LDLT) factorization.

### cusolverDn<t>potrf()

These helper functions calculate the necessary size of work buffers.
```cusolverStatus_t
cusolverDnSpotrf_bufferSize(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnDpotrf_bufferSize(cusolveDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnCpotrf_bufferSize(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnZpotrf_bufferSize(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
int *Lwork);
```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSpotrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
float *Workspace,
int Lwork,
int *devInfo );

cusolverStatus_t
cusolverDnDpotrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
double *Workspace,
int Lwork,
int *devInfo );
```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCpotrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
cuComplex *Workspace,
int Lwork,
int *devInfo );

cusolverStatus_t
cusolverDnZpotrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
cuDoubleComplex *Workspace,
int Lwork,
int *devInfo );
```

This function computes the Cholesky factorization of a Hermitian positive-definite matrix.

A is a n×n Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo indicates which part of the matrix is used. The function would leave other part untouched.

If input parameter uplo is CUBLAS_FILL_MODE_LOWER, only lower triangular part of A is processed, and replaced by lower triangular Cholesky factor L.

 $A=L*{L}^{H}$

If input parameter uplo is CUSBLAS_FILL_MODE_UPPER, only upper triangular part of A is processed, and replaced by upper triangular Cholesky factor U.

 $A={U}^{H}*U$

The user has to provide working space which is pointed by input parameter Workspace. The input parameter Lwork is size of the working space, and it is returned by potrf_bufferSize().

If Cholesky factorization failed, i.e. some leading minor of A is not positive definite, or equivalently some diagonal elements of L or U is not a real number. The output parameter devInfo would indicate smallest leading minor of A which is not positive definite.

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

API of potrf
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. uplo host input indicates if matrix A lower or upper part is stored, the other part is not referenced. n host input number of rows and columns of matrix A. A device in/out array of dimension lda * n with lda is not less than max(1,n). lda host input leading dimension of two-dimensional array used to store matrix A. Workspace device in/out working space, array of size Lwork. Lwork host input size of Workspace, returned by potrf_bufferSize. devInfo device output if devInfo = 0, the Cholesky factorization is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo = i, the leading minor of order i is not positive definite.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0 or lda

### cusolverDn<t>potrs()

```cusolverStatus_t
cusolverDnSpotrs(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
const float *A,
int lda,
float *B,
int ldb,
int *devInfo);

cusolverStatus_t
cusolverDnDpotrs(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
const double *A,
int lda,
double *B,
int ldb,
int *devInfo);

cusolverStatus_t
cusolverDnCpotrs(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
const cuComplex *A,
int lda,
cuComplex *B,
int ldb,
int *devInfo);

cusolverStatus_t
cusolverDnZpotrs(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
const cuDoubleComplex *A,
int lda,
cuDoubleComplex *B,
int ldb,
int *devInfo);

```

This function solves a system of linear equations

 $A*X=B$

where A is a n×n Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo indicates which part of the matrix is used. The function would leave other part untouched.

The user has to call potrf first to factorize matrix A. If input parameter uplo is CUBLAS_FILL_MODE_LOWER, A is lower triangular Cholesky factor L correspoding to $A=L*{L}^{H}$ . If input parameter uplo is CUSBLAS_FILL_MODE_UPPER, A is upper triangular Cholesky factor U corresponding to $A={U}^{H}*U$ .

The operation is in-place, i.e. matrix X overwrites matrix B with the same leading dimension ldb.

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

API of potrs
 parameter Memory In/out Meaning handle host input handle to the cuSolveDN library context. uplo host input indicates if matrix A lower or upper part is stored, the other part is not referenced. n host input number of rows and columns of matrix A. nrhs host input number of columns of matrix X and B. A device input array of dimension lda * n with lda is not less than max(1,n). A is either lower cholesky factor L or upper Cholesky factor U. lda host input leading dimension of two-dimensional array used to store matrix A. B device in/out array of dimension ldb * nrhs. ldb is not less than max(1,n). As an input, B is right hand side matrix. As an output, B is the solution matrix. devInfo device output if devInfo = 0, the Cholesky factorization is successful. if devInfo = -i, the i-th parameter is wrong.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0, nrhs<0, lda

### cusolverDn<t>getrf()

These helper functions calculate the size of work buffers needed.
```cusolverStatus_t
cusolverDnSgetrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
float *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnDgetrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
double *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnCgetrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
cuComplex *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnZgetrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
cuDoubleComplex *A,
int lda,
int *Lwork );
```
The S and D data types are real single and double precision, respectively.
```cusolverStatus_t
cusolverDnSgetrf(cusolverDnHandle_t handle,
int m,
int n,
float *A,
int lda,
float *Workspace,
int *devIpiv,
int *devInfo );

cusolverStatus_t
cusolverDnDgetrf(cusolverDnHandle_t handle,
int m,
int n,
double *A,
int lda,
double *Workspace,
int *devIpiv,
int *devInfo );
```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCgetrf(cusolverDnHandle_t handle,
int m,
int n,
cuComplex *A,
int lda,
cuComplex *Workspace,
int *devIpiv,
int *devInfo );

cusolverStatus_t
cusolverDnZgetrf(cusolverDnHandle_t handle,
int m,
int n,
cuDoubleComplex *A,
int lda,
cuDoubleComplex *Workspace,
int *devIpiv,
int *devInfo );
```

This function computes the LU factorization of a m×n matrix

 $P*A=L*U$

where A is a m×n matrix, P is a permutation matrix, L is a lower triangular matrix with unit diagonal, and U is an upper triangular matrix.

The user has to provide working space which is pointed by input parameter Workspace. The input parameter Lwork is size of the working space, and it is returned by getrf_bufferSize().

If LU factorization failed, i.e. matrix A (U) is singular, The output parameter devInfo=i indicates U(i,i) = 0.

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

If devIpiv is null, no pivoting is performed. The factorization is A=L*U, which is not numerically stable.

No matter LU factorization failed or not, the output parameter devIpiv contains pivoting sequence, row i is interchanged with row devIpiv(i).

The user can combine getrf and getrs to complete a linear solver. Please refer to appendix D.1.

API of getrf
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. m host input number of rows of matrix A. n host input number of columns of matrix A. A device in/out array of dimension lda * n with lda is not less than max(1,m). lda host input leading dimension of two-dimensional array used to store matrix A. Workspace device in/out working space, array of size Lwork. devIpiv device output array of size at least min(m,n), containing pivot indices. devInfo device output if devInfo = 0, the LU factorization is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo = i, the U(i,i) = 0.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,n<0 or lda

### cusolverDn<t>getrs()

```cusolverStatus_t
cusolverDnSgetrs(cusolverDnHandle_t handle,
cublasOperation_t trans,
int n,
int nrhs,
const float *A,
int lda,
const int *devIpiv,
float *B,
int ldb,
int *devInfo );

cusolverStatus_t
cusolverDnDgetrs(cusolverDnHandle_t handle,
cublasOperation_t trans,
int n,
int nrhs,
const double *A,
int lda,
const int *devIpiv,
double *B,
int ldb,
int *devInfo );

cusolverStatus_t
cusolverDnCgetrs(cusolverDnHandle_t handle,
cublasOperation_t trans,
int n,
int nrhs,
const cuComplex *A,
int lda,
const int *devIpiv,
cuComplex *B,
int ldb,
int *devInfo );

cusolverStatus_t
cusolverDnZgetrs(cusolverDnHandle_t handle,
cublasOperation_t trans,
int n,
int nrhs,
const cuDoubleComplex *A,
int lda,
const int *devIpiv,
cuDoubleComplex *B,
int ldb,
int *devInfo );

```

This function solves a linear system of multiple right-hand sides

 $\mathrm{op\left(A\right)}*X=B$

where A is a n×n matrix, and was LU-factored by getrf, that is, lower trianular part of A is L, and upper triangular part (including diagonal elements) of A is U. B is a n×nrhs right-hand side matrix.

The input parameter trans is defined by

The input parameter devIpiv is an output of getrf. It contains pivot indices, which are used to permutate right-hand sides.

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

The user can combine getrf and getrs to complete a linear solver. Please refer to appendix D.1.

 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. trans host input operation op(A) that is non- or (conj.) transpose. n host input number of rows and columns of matrix A. nrhs host input number of right-hand sides. A device input array of dimension lda * n with lda is not less than max(1,n). lda host input leading dimension of two-dimensional array used to store matrix A. devIpiv device input array of size at least n, containing pivot indices. B device output array of dimension ldb * nrhs with ldb is not less than max(1,n). ldb host input leading dimension of two-dimensional array used to store matrix B. devInfo device output if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0 or lda

### cusolverDn<t>geqrf()

These helper functions calculate the size of work buffers needed.
```cusolverStatus_t
cusolverDnSgeqrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
float *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnDgeqrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
double *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnCgeqrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
cuComplex *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnZgeqrf_bufferSize(cusolverDnHandle_t handle,
int m,
int n,
cuDoubleComplex *A,
int lda,
int *Lwork );
```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSgeqrf(cusolverDnHandle_t handle,
int m,
int n,
float *A,
int lda,
float *TAU,
float *Workspace,
int Lwork,
int *devInfo );

cusolverStatus_t
cusolverDnDgeqrf(cusolverDnHandle_t handle,
int m,
int n,
double *A,
int lda,
double *TAU,
double *Workspace,
int Lwork,
int *devInfo );
```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCgeqrf(cusolverDnHandle_t handle,
int m,
int n,
cuComplex *A,
int lda,
cuComplex *TAU,
cuComplex *Workspace,
int Lwork,
int *devInfo );

cusolverStatus_t
cusolverDnZgeqrf(cusolverDnHandle_t handle,
int m,
int n,
cuDoubleComplex *A,
int lda,
cuDoubleComplex *TAU,
cuDoubleComplex *Workspace,
int Lwork,
int *devInfo );
```

This function computes the QR factorization of a m×n matrix

 $A=Q*R$

where A is a m×n matrix, Q is a m×n matrix, and R is a n×n upper triangular matrix.

The user has to provide working space which is pointed by input parameter Workspace. The input parameter Lwork is size of the working space, and it is returned by geqrf_bufferSize().

The matrix R is overwritten in upper triangular part of A, including diagonal elements.

The matrix Q is not formed explicitly, instead, a sequence of householder vectors are stored in lower triangular part of A. The leading nonzero element of householder vector is assumed to be 1 such that output parameter TAU contains the scaling factor τ. If v is original householder vector, q is the new householder vector corresponding to τ, satisying the following relation

 $I-2*v*{v}^{H}=I-\tau *q*{q}^{H}$

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

API of geqrf
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. m host input number of rows of matrix A. n host input number of columns of matrix A. A device in/out array of dimension lda * n with lda is not less than max(1,m). lda host input leading dimension of two-dimensional array used to store matrix A. TAU device output array of dimension at least min(m,n). Workspace device in/out working space, array of size Lwork. Lwork host input size of working array Workspace. devInfo device output if info = 0, the LU factorization is successful. if info = -i, the i-th parameter is wrong.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,n<0 or lda

### cusolverDn<t>ormqr()

These helper functions calculate the size of work buffers needed.
```cusolverStatus_t
cusolverDnSormqr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const float *A,
int lda,
const float *C,
int ldc,
int *lwork);

cusolverStatus_t
cusolverDnDormqr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const double *A,
int lda,
const double *C,
int ldc,
int *lwork);

cusolverStatus_t
cusolverDnCunmqr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const cuComplex *A,
int lda,
const cuComplex *C,
int ldc,
int *lwork);

cusolverStatus_t
cusolverDnZunmqr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *C,
int ldc,
int *lwork);

```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSormqr(cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const float *A,
int lda,
const float *tau,
float *C,
int ldc,
float *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnDormqr(cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const double *A,
int lda,
const double *tau,
double *C,
int ldc,
double *work,
int lwork,
int *devInfo);
```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCunmqr(cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const cuComplex *A,
int lda,
const cuComplex *tau,
cuComplex *C,
int ldc,
cuComplex *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnZunmqr(cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasOperation_t trans,
int m,
int n,
int k,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
cuDoubleComplex *C,
int ldc,
cuDoubleComplex *work,
int lwork,
int *devInfo);
```

This function overwrites m×n matrix C by

The operation of Q is defined by

Q is a unitary matrix formed by a sequence of elementary reflection vectors from QR factorization (geqrf) of A.

Q=H(1)H(2) ... H(k)

Q is of order m if side = CUBLAS_SIDE_LEFT and of order n if side = CUBLAS_SIDE_RIGHT.

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by geqrf_bufferSize() or ormqr_bufferSize().

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

The user can combine geqrf, ormqr and trsm to complete a linear solver or a least-square solver. Please refer to appendix C.1.

API of ormqr
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. side host input indicates if matrix Q is on the left or right of C. trans host input operation op(Q) that is non- or (conj.) transpose. m host input number of rows of matrix A. n host input number of columns of matrix A. k host input number of elementary relfections. A device in/out array of dimension lda * k with lda is not less than max(1,m). The matrix A is from geqrf, so i-th column contains elementary reflection vector. lda host input leading dimension of two-dimensional array used to store matrix A. if side is CUBLAS_SIDE_LEFT, lda >= max(1,m); if side is CUBLAS_SIDE_RIGHT, lda >= max(1,n). tau device output array of dimension at least min(m,n). The vector tau is from geqrf, so tau(i) is the scalar of i-th elementary reflection vector. C device in/out array of size ldc * n. On exit, C is overwritten by op(Q)*C. ldc host input leading dimension of two-dimensional array of matrix C. ldc >= max(1,m). work device in/out working space, array of size lwork. lwork host input size of working array work. devInfo device output if info = 0, the ormqr is successful. if info = -i, the i-th parameter is wrong.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,n<0 or wrong lda or ldc). CUSOLVER_STATUS_ARCH_MISMATCH the device only supports compute capability 2.0 and above. CUSOLVER_STATUS_INTERNAL_ERROR an internal operation failed.

### cusolverDn<t>orgqr()

These helper functions calculate the size of work buffers needed.
```cusolverStatus_t
cusolverDnSorgqr_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int k,
const float *A,
int lda,
int *lwork);

cusolverStatus_t
cusolverDnDorgqr_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int k,
const double *A,
int lda,
int *lwork);

cusolverStatus_t
cusolverDnCungqr_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int k,
const cuComplex *A,
int lda,
int *lwork);

cusolverStatus_t
cusolverDnZungqr_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int k,
const cuDoubleComplex *A,
int lda,
int *lwork);

```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSorgqr(
cusolverDnHandle_t handle,
int m,
int n,
int k,
float *A,
int lda,
const float *tau,
float *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnDorgqr(
cusolverDnHandle_t handle,
int m,
int n,
int k,
double *A,
int lda,
const double *tau,
double *work,
int lwork,
int *devInfo);

```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCungqr(
cusolverDnHandle_t handle,
int m,
int n,
int k,
cuComplex *A,
int lda,
const cuComplex *tau,
cuComplex *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnZungqr(
cusolverDnHandle_t handle,
int m,
int n,
int k,
cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
cuDoubleComplex *work,
int lwork,
int *devInfo);

```

This function overwrites m×n matrix A by

 $Q=\mathrm{H\left(1\right)}*\mathrm{H\left(2\right)}*\mathrm{...}*\mathrm{H\left(k\right)}$

where Q is a unitary matrix formed by a sequence of elementary reflection vectors stored in A.

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by orgqr_bufferSize().

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

The user can combine geqrf, orgqr to complete orthogonalization. Please refer to appendix C.2.

API of ormqr
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. m host input number of rows of matrix Q. m >= 0; n host input number of columns of matrix Q. m >= n >= 0; k host input number of elementary relfections whose product defines the matrix Q. n >= k >= 0; A device in/out array of dimension lda * n with lda is not less than max(1,m). i-th column of A contains elementary reflection vector. lda host input leading dimension of two-dimensional array used to store matrix A. lda >= max(1,m). tau device output array of dimension k. tau(i) is the scalar of i-th elementary reflection vector. work device in/out working space, array of size lwork. lwork host input size of working array work. devInfo device output if info = 0, the orgqr is successful. if info = -i, the i-th parameter is wrong.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,n,k<0, n>m, k>n or lda

### cusolverDn<t>sytrf()

These helper functions calculate the size of the needed buffers.
```cusolverStatus_t
cusolverDnSsytrf_bufferSize(cusolverDnHandle_t handle,
int n,
float *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnDsytrf_bufferSize(cusolverDnHandle_t handle,
int n,
double *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnCsytrf_bufferSize(cusolverDnHandle_t handle,
int n,
cuComplex *A,
int lda,
int *Lwork );

cusolverStatus_t
cusolverDnZsytrf_bufferSize(cusolverDnHandle_t handle,
int n,
cuDoubleComplex *A,
int lda,
int *Lwork );
```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSsytrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
int *ipiv,
float *work,
int lwork,
int *devInfo );

cusolverStatus_t
cusolverDnDsytrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
int *ipiv,
double *work,
int lwork,
int *devInfo );
```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCsytrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
int *ipiv,
cuComplex *work,
int lwork,
int *devInfo );

cusolverStatus_t
cusolverDnZsytrf(cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
int *ipiv,
cuDoubleComplex *work,
int lwork,
int *devInfo );
```

This function computes the Bunch-Kaufman factorization of a n×n symmetric indefinite matrix

A is a n×n symmetric matrix, only lower or upper part is meaningful. The input parameter uplo which part of the matrix is used. The function would leave other part untouched.

If input parameter uplo is CUBLAS_FILL_MODE_LOWER, only lower triangular part of A is processed, and replaced by lower triangular factor L and block diagonal matrix D. Each block of D is either 1x1 or 2x2 block, depending on pivoting.

 $P*A*{P}^{T}=L*D*{L}^{T}$

If input parameter uplo is CUBLAS_FILL_MODE_UPPER, only upper triangular part of A is processed, and replaced by upper triangular factor U and block diagonal matrix D.

 $P*A*{P}^{T}=U*D*{U}^{T}$

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by sytrf_bufferSize().

If Bunch-Kaufman factorization failed, i.e. A is singular. The output parameter devInfo = i would indicate D(i,i)=0.

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

The output parameter devIpiv contains pivoting sequence. If devIpiv(i) = k > 0, D(i,i) is 1x1 block, and i-th row/column of A is interchanged with k-th row/column of A. If uplo is CUSBLAS_FILL_MODE_UPPER and devIpiv(i-1) = devIpiv(i) = -m < 0, D(i-1:i,i-1:i) is a 2x2 block, and (i-1)-th row/column is interchanged with m-th row/column. If uplo is CUSBLAS_FILL_MODE_LOWER and devIpiv(i+1) = devIpiv(i) = -m < 0, D(i:i+1,i:i+1) is a 2x2 block, and (i+1)-th row/column is interchanged with m-th row/column.

API of sytrf
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. uplo host input indicates if matrix A lower or upper part is stored, the other part is not referenced. n host input number of rows and columns of matrix A. A device in/out array of dimension lda * n with lda is not less than max(1,n). lda host input leading dimension of two-dimensional array used to store matrix A. ipiv device output array of size at least n, containing pivot indices. work device in/out working space, array of size lwork. lwork host input size of working space work. devInfo device output if devInfo = 0, the LU factorization is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo = i, the D(i,i) = 0.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0 or lda

### cusolverDn<t>potrfBatched()

The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSpotrfBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *Aarray[],
int lda,
int *infoArray,
int batchSize);

cusolverStatus_t
cusolverDnDpotrfBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *Aarray[],
int lda,
int *infoArray,
int batchSize);
```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCpotrfBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *Aarray[],
int lda,
int *infoArray,
int batchSize);

cusolverStatus_t
cusolverDnZpotrfBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *Aarray[],
int lda,
int *infoArray,
int batchSize);

```

This function computes the Cholesky factorization of a squence of Hermitian positive-definite matrices.

Each Aarray[i] for i=0,1,..., batchSize-1 is a n×n Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo indicates which part of the matrix is used.

If input parameter uplo is CUBLAS_FILL_MODE_LOWER, only lower triangular part of A is processed, and replaced by lower triangular Cholesky factor L.

 $A=L*{L}^{H}$

If input parameter uplo is CUSBLAS_FILL_MODE_UPPER, only upper triangular part of A is processed, and replaced by upper triangular Cholesky factor U.

 $A={U}^{H}*U$

If Cholesky factorization failed, i.e. some leading minor of A is not positive definite, or equivalently some diagonal elements of L or U is not a real number. The output parameter infoArray would indicate smallest leading minor of A which is not positive definite.

infoArray is an integer array of size batchsize. If potrfBatched returns CUSOLVER_STATUS_INVALID_VALUE, infoArray[0] = -i (less than zero), meaning that the i-th parameter is wrong. If potrfBatched returns CUSOLVER_STATUS_SUCCESS but infoArray[i] = k is positive, then i-th matrix is not positive definite and the Cholesky factorization failed at row k.

Remark: the other part of A is used as a workspace. For example, if uplo is CUSBLAS_FILL_MODE_UPPER, upper triangle of A contains cholesky factor U and lower triangle of A is destroyed after potrfBatched.

API of potrfBatched
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. uplo host input indicates if lower or upper part is stored, the other part is used as a workspace. n host input number of rows and columns of matrix A. Aarray device in/out array of pointers to array of dimension lda * n with lda is not less than max(1,n). lda host input leading dimension of two-dimensional array used to store each matrix Aarray[i]. infoArray device output array of size batchSize. infoArray[i] contains information of factorization of Aarray[i]. if infoArray[i] = 0, the Cholesky factorization is successful. if infoArray[i] = k, the leading minor of order k is not positive definite. batchSize host input number of pointers in Aarray.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0 or lda

### cusolverDn<t>potrsBatched()

```cusolverStatus_t
cusolverDnSpotrsBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
float *Aarray[],
int lda,
float *Barray[],
int ldb,
int *info,
int batchSize);

cusolverStatus_t
cusolverDnDpotrsBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
double *Aarray[],
int lda,
double *Barray[],
int ldb,
int *info,
int batchSize);

cusolverStatus_t
cusolverDnCpotrsBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
cuComplex *Aarray[],
int lda,
cuComplex *Barray[],
int ldb,
int *info,
int batchSize);

cusolverStatus_t
cusolverDnZpotrsBatched(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
int nrhs,
cuDoubleComplex *Aarray[],
int lda,
cuDoubleComplex *Barray[],
int ldb,
int *info,
int batchSize);

```

This function solves a squence of linear systems

 $\mathrm{A\left[i\right]}*\mathrm{X\left[i\right]}=\mathrm{B\left[i\right]}$

where each Aarray[i] for i=0,1,..., batchSize-1 is a n×n Hermitian matrix, only lower or upper part is meaningful. The input parameter uplo indicates which part of the matrix is used.

The user has to call potrfBatched first to factorize matrix Aarray[i]. If input parameter uplo is CUBLAS_FILL_MODE_LOWER, A is lower triangular Cholesky factor L correspoding to $A=L*{L}^{H}$ . If input parameter uplo is CUSBLAS_FILL_MODE_UPPER, A is upper triangular Cholesky factor U corresponding to $A={U}^{H}*U$ .

The operation is in-place, i.e. matrix X overwrites matrix B with the same leading dimension ldb.

The output parameter info is a scalar. If info = -i (less than zero), the i-th parameter is wrong.

Remark 1: only nrhs=1 is supported.

Remark 2: infoArray from potrfBatched indicates if the matrix is positive definite. info from potrsBatched only shows which input parameter is wrong.

Remark 3: the other part of A is used as a workspace. For example, if uplo is CUSBLAS_FILL_MODE_UPPER, upper triangle of A contains cholesky factor U and lower triangle of A is destroyed after potrsBatched.

API of potrsBatched
 parameter Memory In/out Meaning handle host input handle to the cuSolveDN library context. uplo host input indicates if matrix A lower or upper part is stored. n host input number of rows and columns of matrix A. nrhs host input number of columns of matrix X and B. Aarray device in/out array of pointers to array of dimension lda * n with lda is not less than max(1,n). Aarray[i] is either lower cholesky factor L or upper Cholesky factor U. lda host input leading dimension of two-dimensional array used to store each matrix Aarray[i]. Barray device in/out array of pointers to array of dimension ldb * nrhs. ldb is not less than max(1,n). As an input, Barray[i] is right hand side matrix. As an output, Barray[i] is the solution matrix. ldb host input leading dimension of two-dimensional array used to store each matrix Barray[i]. info device output if info = 0, all parameters are correct. if info = -i, the i-th parameter is wrong. batchSize host input number of pointers in Aarray.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0, nrhs<0, lda

### Dense Eigenvalue Solver Reference

This chapter describes eigenvalue solver API of cuSolverDN, including bidiagonalization and SVD.

### cusolverDn<t>gebrd()

These helper functions calculate the size of work buffers needed.
```cusolverStatus_t
cusolverDnSgebrd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *Lwork );

cusolverStatus_t
cusolverDnDgebrd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *Lwork );

cusolverStatus_t
cusolverDnCgebrd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *Lwork );

cusolverStatus_t
cusolverDnZgebrd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *Lwork );
```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSgebrd(cusolverDnHandle_t handle,
int m,
int n,
float *A,
int lda,
float *D,
float *E,
float *TAUQ,
float *TAUP,
float *Work,
int Lwork,
int *devInfo );

cusolverStatus_t
cusolverDnDgebrd(cusolverDnHandle_t handle,
int m,
int n,
double *A,
int lda,
double *D,
double *E,
double *TAUQ,
double *TAUP,
double *Work,
int Lwork,
int *devInfo );
```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCgebrd(cusolverDnHandle_t handle,
int m,
int n,
cuComplex *A,
int lda,
float *D,
float *E,
cuComplex *TAUQ,
cuComplex *TAUP,
cuComplex *Work,
int Lwork,
int *devInfo );

cusolverStatus_t
cusolverDnZgebrd(cusolverDnHandle_t handle,
int m,
int n,
cuDoubleComplex *A,
int lda,
double *D,
double *E,
cuDoubleComplex *TAUQ,
cuDoubleComplex *TAUP,
cuDoubleComplex *Work,
int Lwork,
int *devInfo );
```

This function reduces a general m×n matrix A to a real upper or lower bidiagonal form B by an orthogonal transformation: ${Q}^{H}*A*P=B$

If m>=n, B is upper bidiagonal; if m<n, B is lower bidiagonal.

The matrix Q and P are overwritten into matrix A in the following sense:

if m>=n, the diagonal and the first superdiagonal are overwritten with the upper bidiagonal matrix B; the elements below the diagonal, with the array TAUQ, represent the orthogonal matrix Q as a product of elementary reflectors, and the elements above the first superdiagonal, with the array TAUP, represent the orthogonal matrix P as a product of elementary reflectors.

if m<n, the diagonal and the first subdiagonal are overwritten with the lower bidiagonal matrix B; the elements below the first subdiagonal, with the array TAUQ, represent the orthogonal matrix Q as a product of elementary reflectors, and the elements above the diagonal, with the array TAUP, represent the orthogonal matrix P as a product of elementary reflectors.

The user has to provide working space which is pointed by input parameter Work. The input parameter Lwork is size of the working space, and it is returned by gebrd_bufferSize().

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

Remark: gebrd only supports m>=n.

API of gebrd
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. m host input number of rows of matrix A. n host input number of columns of matrix A. A device in/out array of dimension lda * n with lda is not less than max(1,n). lda host input leading dimension of two-dimensional array used to store matrix A. D device output real array of dimension min(m,n). The diagonal elements of the bidiagonal matrix B: D(i) = A(i,i). E device output real array of dimension min(m,n). The off-diagonal elements of the bidiagonal matrix B: if m>=n, E(i) = A(i,i+1) for i = 1,2,...,n-1; if m array of dimension min(m,n). The scalar factors of the elementary reflectors which represent the orthogonal matrix Q. TAUP device output array of dimension min(m,n). The scalar factors of the elementary reflectors which represent the orthogonal matrix P. Work device in/out working space, array of size Lwork. Lwork host input size of Work, returned by gebrd_bufferSize. devInfo device output if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,n<0, or lda

### cusolverDn<t>orgbr()

These helper functions calculate the size of work buffers needed.
```cusolverStatus_t
cusolverDnSorgbr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
const float *A,
int lda,
const float *tau,
int *lwork);

cusolverStatus_t
cusolverDnDorgbr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
const double *A,
int lda,
const double *tau,
int *lwork);

cusolverStatus_t
cusolverDnCungbr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
const cuComplex *A,
int lda,
const cuComplex *tau,
int *lwork);

cusolverStatus_t
cusolverDnZungbr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
int *lwork);

```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSorgbr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
float *A,
int lda,
const float *tau,
float *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnDorgbr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
double *A,
int lda,
const double *tau,
double *work,
int lwork,
int *devInfo);

```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCungbr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
cuComplex *A,
int lda,
const cuComplex *tau,
cuComplex *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnZungbr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
int m,
int n,
int k,
cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
cuDoubleComplex *work,
int lwork,
int *devInfo);

```

This function generates one of the unitary matrices Q or P**H determined by gebrd when reducing a matrix A to bidiagonal form: ${Q}^{H}*A*P=B$

Q and P**H are defined as products of elementary reflectors H(i) or G(i) respectively.

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by orgbr_bufferSize().

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

API of orgbr
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. side host input if side = CUBLAS_SIDE_LEFT, generate Q. if side = CUBLAS_SIDE_RIGHT, generate P**T. m host input number of rows of matrix Q or P**T. n host input if side = CUBLAS_SIDE_LEFT, m>= n>= min(m,k). if side = CUBLAS_SIDE_RIGHT, n>= m>= min(n,k). k host input if side = CUBLAS_SIDE_LEFT, the number of columns in the original m-by-k matrix reduced by gebrd. if side = CUBLAS_SIDE_RIGHT, the number of rows in the original k-by-n matrix reduced by gebrd. A device in/out array of dimension lda * n On entry, the vectors which define the elementary reflectors, as returned by gebrd. On exit, the m-by-n matrix Q or P**T. lda host input leading dimension of two-dimensional array used to store matrix A. lda >= max(1,m); tau device output array of dimension min(m,k) if side is CUBLAS_SIDE_LEFT; of dimension min(n,k) if side is CUBLAS_SIDE_RIGHT; tau(i) must contain the scalar factor of the elementary reflector H(i) or G(i), which determines Q or P**T, as returned by gebrd in its array argument TAUQ or TAUP. work device in/out working space, array of size lwork. lwork host input size of working array work. devInfo device output if info = 0, the ormqr is successful. if info = -i, the i-th parameter is wrong.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,n<0 or wrong lda ). CUSOLVER_STATUS_ARCH_MISMATCH the device only supports compute capability 2.0 and above. CUSOLVER_STATUS_INTERNAL_ERROR an internal operation failed.

### cusolverDn<t>sytrd()

These helper functions calculate the size of work buffers needed.
```cusolverStatus_t
cusolverDnSsytrd_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const float *A,
int lda,
const float *d,
const float *e,
const float *tau,
int *lwork);

cusolverStatus_t
cusolverDnDsytrd_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const double *A,
int lda,
const double *d,
const double *e,
const double *tau,
int *lwork);

cusolverStatus_t
cusolverDnChetrd_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const cuComplex *A,
int lda,
const float *d,
const float *e,
const cuComplex *tau,
int *lwork);

cusolverStatus_t
cusolverDnZhetrd_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const cuDoubleComplex *A,
int lda,
const double *d,
const double *e,
const cuDoubleComplex *tau,
int *lwork);

```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSsytrd(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
float *d,
float *e,
float *tau,
float *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnDsytrd(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
double *d,
double *e,
double *tau,
double *work,
int lwork,
int *devInfo);
```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnChetrd(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
float *d,
float *e,
cuComplex *tau,
cuComplex *work,
int lwork,
int *devInfo);

cusolverStatus_t CUDENSEAPI cusolverDnZhetrd(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
double *d,
double *e,
cuDoubleComplex *tau,
cuDoubleComplex *work,
int lwork,
int *devInfo);
```

This function reduces a general symmetric (Hermitian) n×n matrix A to real symmetric tridiagonal form T by an orthogonal transformation: ${Q}^{H}*A*Q=T$

As an output, A contains T and householder reflection vectors. If uplo = CUBLAS_FILL_MODE_UPPER, the diagonal and first superdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements above the first superdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors; If uplo = CUBLAS_FILL_MODE_LOWER, the diagonal and first subdiagonal of A are overwritten by the corresponding elements of the tridiagonal matrix T, and the elements below the first subdiagonal, with the array tau, represent the orthogonal matrix Q as a product of elementary reflectors.

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by sytrd_bufferSize().

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

API of sytrd
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. uplo host input specifies which part of A is stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A is stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A is stored. n host input number of rows (columns) of matrix A. A device in/out array of dimension lda * n with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A, and the strictly lower triangular part of A is not referenced. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A, and the strictly upper triangular part of A is not referenced. On exit, A is overwritten by T and householder reflection vectors. lda host input leading dimension of two-dimensional array used to store matrix A. lda >= max(1,n). D device output real array of dimension n. The diagonal elements of the tridiagonal matrix T: D(i) = A(i,i). E device output real array of dimension (n-1). The off-diagonal elements of the tridiagonal matrix T: if uplo = CUBLAS_FILL_MODE_UPPER, E(i) = A(i,i+1). if uplo = CUBLAS_FILL_MODE_LOWERE(i) = A(i+1,i). tau device output array of dimension (n-1). The scalar factors of the elementary reflectors which represent the orthogonal matrix Q. work device in/out working space, array of size lwork. lwork host input size of work, returned by sytrd_bufferSize. devInfo device output if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0, or lda

### cusolverDn<t>ormtr()

These helper functions calculate the size of work buffers needed.
```cusolverStatus_t
cusolverDnSormtr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
const float *A,
int lda,
const float *tau,
const float *C,
int ldc,
int *lwork);

cusolverStatus_t
cusolverDnDormtr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
const double *A,
int lda,
const double *tau,
const double *C,
int ldc,
int *lwork);

cusolverStatus_t
cusolverDnCunmtr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
const cuComplex *A,
int lda,
const cuComplex *tau,
const cuComplex *C,
int ldc,
int *lwork);

cusolverStatus_t
cusolverDnZunmtr_bufferSize(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
const cuDoubleComplex *C,
int ldc,
int *lwork);

```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSormtr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
float *A,
int lda,
float *tau,
float *C,
int ldc,
float *work,
int lwork,
int *info);

cusolverStatus_t
cusolverDnDormtr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
double *A,
int lda,
double *tau,
double *C,
int ldc,
double *work,
int lwork,
int *info);

```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCunmtr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
cuComplex *A,
int lda,
cuComplex *tau,
cuComplex *C,
int ldc,
cuComplex *work,
int lwork,
int *info);

cusolverStatus_t
cusolverDnZunmtr(
cusolverDnHandle_t handle,
cublasSideMode_t side,
cublasFillMode_t uplo,
cublasOperation_t trans,
int m,
int n,
cuDoubleComplex *A,
int lda,
cuDoubleComplex *tau,
cuDoubleComplex *C,
int ldc,
cuDoubleComplex *work,
int lwork,
int *info);

```

This function overwrites m×n matrix C by

where Q is a unitary matrix formed by a sequence of elementary reflection vectors from sytrd.

The operation on Q is defined by

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by ormtr_bufferSize().

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

API of ormtr
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. side host input side = CUBLAS_SIDE_LEFT, apply Q or Q**T from the Left; side = CUBLAS_SIDE_RIGHT, apply Q or Q**T from the Right. uplo host input uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A contains elementary reflectors from sytrd. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A contains elementary reflectors from sytrd. trans host input operation op(Q) that is non- or (conj.) transpose. m host input number of rows of matrix C. n host input number of columns of matrix C. A device in/out array of dimension lda * m if side = CUBLAS_SIDE_LEFT; lda * n if side = CUBLAS_SIDE_RIGHT. The matrix A from sytrd contains the elementary reflectors. lda host input leading dimension of two-dimensional array used to store matrix A. if side is CUBLAS_SIDE_LEFT, lda >= max(1,m); if side is CUBLAS_SIDE_RIGHT, lda >= max(1,n). tau device output array of dimension (m-1) if side is CUBLAS_SIDE_LEFT; of dimension (n-1) if side is CUBLAS_SIDE_RIGHT; The vector tau is from sytrd, so tau(i) is the scalar of i-th elementary reflection vector. C device in/out array of size ldc * n. On exit, C is overwritten by op(Q)*C or C*op(Q). ldc host input leading dimension of two-dimensional array of matrix C. ldc >= max(1,m). work device in/out working space, array of size lwork. lwork host input size of working array work. devInfo device output if info = 0, the ormqr is successful. if info = -i, the i-th parameter is wrong.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,n<0 or wrong lda or ldc). CUSOLVER_STATUS_ARCH_MISMATCH the device only supports compute capability 2.0 and above. CUSOLVER_STATUS_INTERNAL_ERROR an internal operation failed.

### cusolverDn<t>orgtr()

These helper functions calculate the size of work buffers needed.
```cusolverStatus_t
cusolverDnSorgtr_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const float *A,
int lda,
const float *tau,
int *lwork);

cusolverStatus_t
cusolverDnDorgtr_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const double *A,
int lda,
const double *tau,
int *lwork);

cusolverStatus_t
cusolverDnCungtr_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const cuComplex *A,
int lda,
const cuComplex *tau,
int *lwork);

cusolverStatus_t
cusolverDnZungtr_bufferSize(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
int *lwork);

```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSorgtr(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
const float *tau,
float *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnDorgtr(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
const double *tau,
double *work,
int lwork,
int *devInfo);

```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCungtr(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
const cuComplex *tau,
cuComplex *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnZungtr(
cusolverDnHandle_t handle,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
const cuDoubleComplex *tau,
cuDoubleComplex *work,
int lwork,
int *devInfo);

```

This function generates a unitary matrix Q which is defined as the product of n-1 elementary reflectors of order n, as returned by sytrd:

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by orgtr_bufferSize().

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong.

API of orgtr
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. uplo host input uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A contains elementary reflectors from sytrd. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A contains elementary reflectors from sytrd. n host input number of rows (columns) of matrix Q. A device in/out array of dimension lda * n On entry, matrix A from sytrd contains the elementary reflectors. On exit, matrix A contains the n-by-n orthogonal matrix Q. lda host input leading dimension of two-dimensional array used to store matrix A. lda >= max(1,n). tau device output array of dimension (n-1)tau(i) is the scalar of i-th elementary reflection vector. work device in/out working space, array of size lwork. lwork host input size of working array work. devInfo device output if info = 0, the orgtr is successful. if info = -i, the i-th parameter is wrong.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0 or wrong lda ). CUSOLVER_STATUS_ARCH_MISMATCH the device only supports compute capability 2.0 and above. CUSOLVER_STATUS_INTERNAL_ERROR an internal operation failed.

### cusolverDn<t>gesvd()

The helper functions below can calculate the sizes needed for pre-allocated buffer.
```cusolverStatus_t
cusolverDnSgesvd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *lwork );

cusolverStatus_t
cusolverDnDgesvd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *lwork );

cusolverStatus_t
cusolverDnCgesvd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *lwork );

cusolverStatus_t
cusolverDnZgesvd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *lwork );

```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSgesvd (
cusolverDnHandle_t handle,
signed char jobu,
signed char jobvt,
int m,
int n,
float *A,
int lda,
float *S,
float *U,
int ldu,
float *VT,
int ldvt,
float *work,
int lwork,
float *rwork,
int *devInfo);

cusolverStatus_t
cusolverDnDgesvd (
cusolverDnHandle_t handle,
signed char jobu,
signed char jobvt,
int m,
int n,
double *A,
int lda,
double *S,
double *U,
int ldu,
double *VT,
int ldvt,
double *work,
int lwork,
double *rwork,
int *devInfo);

```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCgesvd (
cusolverDnHandle_t handle,
signed char jobu,
signed char jobvt,
int m,
int n,
cuComplex *A,
int lda,
float *S,
cuComplex *U,
int ldu,
cuComplex *VT,
int ldvt,
cuComplex *work,
int lwork,
float *rwork,
int *devInfo);

cusolverStatus_t
cusolverDnZgesvd (
cusolverDnHandle_t handle,
signed char jobu,
signed char jobvt,
int m,
int n,
cuDoubleComplex *A,
int lda,
double *S,
cuDoubleComplex *U,
int ldu,
cuDoubleComplex *VT,
int ldvt,
cuDoubleComplex *work,
int lwork,
double *rwork,
int *devInfo);
```

This function computes the singular value decomposition (SVD) of a m×n matrix A and corresponding the left and/or right singular vectors. The SVD is written

 $A=U*\Sigma *{V}^{H}$

where Σ is an m×n matrix which is zero except for its min(m,n) diagonal elements, U is an m×m unitary matrix, and V is an n×n unitary matrix. The diagonal elements of Σ are the singular values of A; they are real and non-negative, and are returned in descending order. The first min(m,n) columns of U and V are the left and right singular vectors of A.

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by gesvd_bufferSize().

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong. if bdsqr did not converge, devInfo specifies how many superdiagonals of an intermediate bidiagonal form did not converge to zero.

The rwork is real array of dimension (min(m,n)-1). If devInfo>0 and rwork is not nil, rwork contains the unconverged superdiagonal elements of an upper bidiagonal matrix. This is slightly different from LAPACK which puts unconverged superdiagonal elements in work if type is real; in rwork if type is complex. rwork can be a NULL pointer if the user does not want the information from supperdiagonal.

Appendix F.1 provides a simple example of gesvd.

Remark 1: gesvd only supports m>=n.

Remark 2: the routine returns ${V}^{H}$ , not V.

API of gesvd
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. jobu host input specifies options for computing all or part of the matrix U: = 'A': all m columns of U are returned in array U: = 'S': the first min(m,n) columns of U (the left singular vectors) are returned in the array U; = 'O': the first min(m,n) columns of U (the left singular vectors) are overwritten on the array A; = 'N': no columns of U (no left singular vectors) are computed. jobvt host input specifies options for computing all or part of the matrix V**T: = 'A': all N rows of V**T are returned in the array VT; = 'S': the first min(m,n) rows of V**T (the right singular vectors) are returned in the array VT; = 'O': the first min(m,n) rows of V**T (the right singular vectors) are overwritten on the array A; = 'N': no rows of V**T (no right singular vectors) are computed. m host input number of rows of matrix A. n host input number of columns of matrix A. A device in/out array of dimension lda * n with lda is not less than max(1,m). On exit, the contents of A are destroyed. lda host input leading dimension of two-dimensional array used to store matrix A. S device output real array of dimension min(m,n). The singular values of A, sorted so that S(i) >= S(i+1). U device output array of dimension ldu * m with ldu is not less than max(1,m). U contains the m×m unitary matrix U. ldu host input leading dimension of two-dimensional array used to store matrix U. VT device output array of dimension ldvt * n with ldvt is not less than max(1,n). VT contains the n×n unitary matrix V**T. ldvt host input leading dimension of two-dimensional array used to store matrix Vt. work device in/out working space, array of size lwork. lwork host input size of work, returned by gesvd_bufferSize. rwork device input real array of dimension min(m,n)-1. It contains the unconverged superdiagonal elements of an upper bidiagonal matrix if devInfo > 0. devInfo device output if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo > 0, devInfo indicates how many superdiagonals of an intermediate bidiagonal form did not converge to zero.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,n<0 or lda

### cusolverDn<t>gesvdj()

The helper functions below can calculate the sizes needed for pre-allocated buffer.
```cusolverStatus_t
cusolverDnSgesvdj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int econ,
int m,
int n,
const float *A,
int lda,
const float *S,
const float *U,
int ldu,
const float *V,
int ldv,
int *lwork,
gesvdjInfo_t params);

cusolverStatus_t
cusolverDnDgesvdj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int econ,
int m,
int n,
const double *A,
int lda,
const double *S,
const double *U,
int ldu,
const double *V,
int ldv,
int *lwork,
gesvdjInfo_t params);

cusolverStatus_t
cusolverDnCgesvdj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int econ,
int m,
int n,
const cuComplex *A,
int lda,
const float *S,
const cuComplex *U,
int ldu,
const cuComplex *V,
int ldv,
int *lwork,
gesvdjInfo_t params);

cusolverStatus_t
cusolverDnZgesvdj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int econ,
int m,
int n,
const cuDoubleComplex *A,
int lda,
const double *S,
const cuDoubleComplex *U,
int ldu,
const cuDoubleComplex *V,
int ldv,
int *lwork,
gesvdjInfo_t params);

```
The S and D data types are real valued single and double precision, respectively.
```
cusolverStatus_t
cusolverDnSgesvdj(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int econ,
int m,
int n,
float *A,
int lda,
float *S,
float *U,
int ldu,
float *V,
int ldv,
float *work,
int lwork,
int *info,
gesvdjInfo_t params);

cusolverStatus_t
cusolverDnDgesvdj(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int econ,
int m,
int n,
double *A,
int lda,
double *S,
double *U,
int ldu,
double *V,
int ldv,
double *work,
int lwork,
int *info,
gesvdjInfo_t params);

```
The C and Z data types are complex valued single and double precision, respectively.
```
cusolverStatus_t
cusolverDnCgesvdj(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int econ,
int m,
int n,
cuComplex *A,
int lda,
float *S,
cuComplex *U,
int ldu,
cuComplex *V,
int ldv,
cuComplex *work,
int lwork,
int *info,
gesvdjInfo_t params);

cusolverStatus_t
cusolverDnZgesvdj(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int econ,
int m,
int n,
cuDoubleComplex *A,
int lda,
double *S,
cuDoubleComplex *U,
int ldu,
cuDoubleComplex *V,
int ldv,
cuDoubleComplex *work,
int lwork,
int *info,
gesvdjInfo_t params);

```

This function computes the singular value decomposition (SVD) of a m×n matrix A and corresponding the left and/or right singular vectors. The SVD is written

 $A=U*\Sigma *{V}^{H}$

where Σ is an m×n matrix which is zero except for its min(m,n) diagonal elements, U is an m×m unitary matrix, and V is an n×n unitary matrix. The diagonal elements of Σ are the singular values of A; they are real and non-negative, and are returned in descending order. The first min(m,n) columns of U and V are the left and right singular vectors of A.

gesvdj has the same functionality as gesvd. The difference is that gesvd uses QR algorithm and gesvdj uses Jacobi method. The parallelism of Jacobi method gives GPU better performance on small and medium size matrices. Moreover the user can configure gesvdj to perform approximation up to certain accuracy.

gesvdj iteratively generates a sequence of unitary matrices to transform matrix A to the following form

 ${U}^{H}*A*V=S+E$

where S is diagonal and diagonal of E is zero.

During the iterations, the Frobenius norm of E decreases monotonically. As E goes down to zero, S is the set of singular values. In practice, Jacobi method stops if

 ${\mathrm{||E||}}_{F}\le eps*{\mathrm{||A||}}_{F}$

where eps is given tolerance.

gesvdj has two parameters to control the accuracy. First parameter is tolerance (eps). The default value is machine accuracy but The user can use function cusolverDnXgesvdjSetTolerance to set a priori tolerance. The second parameter is maximum number of sweeps which controls number of iterations of Jacobi method. The default value is 100 but the user can use function cusolverDnXgesvdjSetMaxSweeps to set a proper bound. The experimentis show 15 sweeps are good enough to converge to machine accuracy. gesvdj stops either tolerance is met or maximum number of sweeps is met.

Jacobi method has quadratic convergence, so the accuracy is not proportional to number of sweeps. To guarantee certain accuracy, the user should configure tolerance only.

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by gesvdj_bufferSize().

If output parameter info = -i (less than zero), the i-th parameter is wrong. If info = min(m,n)+1, gesvdj does not converge under given tolerance and maximum sweeps.

If the user sets an improper tolerance, gesvdj may not converge. For example, tolerance should not be smaller than machine accuracy.

Appendix F.2 provides a simple example of gesvdj.

Remark 1: gesvdj supports any combination of m and n.

Remark 2: the routine returns V, not ${V}^{H}$ . This is different from gesvd.

API of gesvdj
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. jobz host input specifies options to either compute singular value only or singular vectors as well: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute singular values only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute singular values and singular vectors. econ host input econ = 1 for economy size for U and V. m host input number of rows of matrix A. n host input number of columns of matrix A. A device in/out array of dimension lda * n with lda is not less than max(1,m). On exit, the contents of A are destroyed. lda host input leading dimension of two-dimensional array used to store matrix A. S device output real array of dimension min(m,n). The singular values of A, sorted so that S(i) >= S(i+1). U device output array of dimension ldu * m if econ is zero. If econ is nonzero, the dimension is ldu * min(m,n). U contains the left singular vectors. ldu host input leading dimension of two-dimensional array used to store matrix U. ldu is not less than max(1,m). V device output array of dimension ldv * n if econ is zero. If econ is nonzero, the dimension is ldv * min(m,n). V contains the right singular vectors. ldv host input leading dimension of two-dimensional array used to store matrix V. ldv is not less than max(1,n). work device in/out array of size lwork, working space. lwork host input size of work, returned by gesvdj_bufferSize. info device output if info = 0, the operation is successful. if info = -i, the i-th parameter is wrong. if info = min(m,n)+1, gesvdj dose not converge under given tolerance and maximum sweeps. params host in/out structure filled with parameters of Jacobi algorithm and results of gesvdj.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,n<0 or lda

### cusolverDn<t>gesvdjBatched()

The helper functions below can calculate the sizes needed for pre-allocated buffer.
```
cusolverStatus_t
cusolverDnSgesvdjBatched_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int m,
int n,
const float *A,
int lda,
const float *S,
const float *U,
int ldu,
const float *V,
int ldv,
int *lwork,
gesvdjInfo_t params,
int batchSize);

cusolverStatus_t
cusolverDnDgesvdjBatched_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int m,
int n,
const double *A,
int lda,
const double *S,
const double *U,
int ldu,
const double *V,
int ldv,
int *lwork,
gesvdjInfo_t params,
int batchSize);

cusolverStatus_t
cusolverDnCgesvdjBatched_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int m,
int n,
const cuComplex *A,
int lda,
const float *S,
const cuComplex *U,
int ldu,
const cuComplex *V,
int ldv,
int *lwork,
gesvdjInfo_t params,
int batchSize);

cusolverStatus_t
cusolverDnZgesvdjBatched_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int m,
int n,
const cuDoubleComplex *A,
int lda,
const double *S,
const cuDoubleComplex *U,
int ldu,
const cuDoubleComplex *V,
int ldv,
int *lwork,
gesvdjInfo_t params,
int batchSize);

```
The S and D data types are real valued single and double precision, respectively.
```
cusolverStatus_t
cusolverDnSgesvdjBatched(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int m,
int n,
float *A,
int lda,
float *S,
float *U,
int ldu,
float *V,
int ldv,
float *work,
int lwork,
int *info,
gesvdjInfo_t params,
int batchSize);

cusolverStatus_t
cusolverDnDgesvdjBatched(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int m,
int n,
double *A,
int lda,
double *S,
double *U,
int ldu,
double *V,
int ldv,
double *work,
int lwork,
int *info,
gesvdjInfo_t params,
int batchSize);

```
The C and Z data types are complex valued single and double precision, respectively.
```
cusolverStatus_t
cusolverDnCgesvdjBatched(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int m,
int n,
cuComplex *A,
int lda,
float *S,
cuComplex *U,
int ldu,
cuComplex *V,
int ldv,
cuComplex *work,
int lwork,
int *info,
gesvdjInfo_t params,
int batchSize);

cusolverStatus_t
cusolverDnZgesvdjBatched(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
int m,
int n,
cuDoubleComplex *A,
int lda,
double *S,
cuDoubleComplex *U,
int ldu,
cuDoubleComplex *V,
int ldv,
cuDoubleComplex *work,
int lwork,
int *info,
gesvdjInfo_t params,
int batchSize);

```

This function computes singular values and singular vectors of a squence of general m×n matrices

 ${A}_{j}={U}_{j}*{\Sigma }_{j}*{V}_{j}^{H}$

where ${\Sigma }_{j}$ is a real m×n diagonal matrix which is zero except for its min(m,n) diagonal elements. ${U}_{j}$ (left singular vectors) is a m×m unitary matrix and ${V}_{j}$ (right singular vectors) is a n×n unitary matrix. The diagonal elements of ${\Sigma }_{j}$ are the singular values of ${A}_{j}$ in either descending order or non-sorting order.

gesvdjBatched performs gesvdj on each matrix. It requires that all matrices are of the same size m,n no greater than 32 and are packed in contiguous way,

 $A=\left(\begin{array}{ccc}\mathrm{A0}& \mathrm{A1}& \cdots \end{array}\right)$

Each matrix is column-major with leading dimension lda, so the formula for random access is ${A}_{k}\left(i,j\right)=\mathrm{A\left[ i + lda*j + lda*n*k\right]}$ .

The parameter S also contains singular values of each matrix in contiguous way,

 $S=\left(\begin{array}{ccc}\mathrm{S0}& \mathrm{S1}& \cdots \end{array}\right)$

The formula for random access of S is ${S}_{k}\left(j\right)=\mathrm{S\left[ j + min\left(m,n\right)*k\right]}$ .

Except for tolerance and maximum sweeps, gesvdjBatched can either sort the singular values in descending order (default) or chose as-is (without sorting) by the function cusolverDnXgesvdjSetSortEig. If the user packs several tiny matrices into diagonal blocks of one matrix, non-sorting option can separate singular values of those tiny matrices.

gesvdjBatched cannot report residual and executed sweeps by function cusolverDnXgesvdjGetResidual and cusolverDnXgesvdjGetSweeps. Any call of the above two returns CUSOLVER_STATUS_NOT_SUPPORTED. The user needs to compute residual explicitly.

The user has to provide working space pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by gesvdjBatched_bufferSize().

The output parameter info is an integer array of size batchSize. If the function returns CUSOLVER_STATUS_INVALID_VALUE, the first element info[0] = -i (less than zero) indicates i-th parameter is wrong. Otherwise, if info[i] = min(m,n)+1, gesvdjBatched does not converge on i-th matrix under given tolerance and maximum sweeps.

Appendix F.3 provides a simple example of gesvdjBatched.

API of syevjBatched
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. jobz host input specifies options to either compute singular value only or singular vectors as well: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute singular values only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute singular values and singular vectors. m host input number of rows of matrix Aj. m is no greater than 32. n host input number of columns of matrix Aj. n is no greater than 32. A device in/out array of dimension lda * n * batchSize with lda is not less than max(1,n). on Exit: the contents of Aj are destroyed. lda host input leading dimension of two-dimensional array used to store matrix Aj. S device output a real array of dimension min(m,n)*batchSize. It stores the singular values of Aj in descending order or non-sorting order. U device output array of dimension ldu * m * batchSize. Uj contains the left singular vectors of Aj. ldu host input leading dimension of two-dimensional array used to store matrix Uj. ldu is not less than max(1,m). V device output array of dimension ldv * n * batchSize. Vj contains the right singular vectors of Aj. ldv host input leading dimension of two-dimensional array used to store matrix Vj. ldv is not less than max(1,n). work device in/out array of size lwork, working space. lwork host input size of work, returned by gesvdjBatched_bufferSize. info device output an integer array of dimension batchSize. If CUSOLVER_STATUS_INVALID_VALUE is returned, info[0] = -i (less than zero) indicates i-th parameter is wrong. Otherwise, if info[i] = 0, the operation is successful. if info[i] = min(m,n)+1, gesvdjBatched dose not converge on i-th matrix under given tolerance and maximum sweeps. params host in/out structure filled with parameters of Jacobi algorithm. batchSize host input number of matrices.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,n<0 or lda

### cusolverDn<t>syevd()

The helper functions below can calculate the sizes needed for pre-allocated buffer.
```cusolverStatus_t
cusolverDnSsyevd_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const float *A,
int lda,
const float *W,
int *lwork);

cusolverStatus_t
cusolverDnDsyevd_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const double *A,
int lda,
const double *W,
int *lwork);

cusolverStatus_t
cusolverDnCheevd_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const cuComplex *A,
int lda,
const float *W,
int *lwork);

cusolverStatus_t
cusolverDnZheevd_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const cuDoubleComplex *A,
int lda,
const double *W,
int *lwork);

```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSsyevd(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
float *W,
float *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnDsyevd(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
double *W,
double *work,
int lwork,
int *devInfo);

```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCheevd(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
float *W,
cuComplex *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnZheevd(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
double *W,
cuDoubleComplex *work,
int lwork,
int *devInfo);

```

This function computes eigenvalues and eigenvectors of a symmetric (Hermitian) n×n matrix A. The standard symmetric eigenvalue problem is

 $A*V=V*\Lambda$

where Λ is a real n×n diagonal matrix. V is an n×n unitary matrix. The diagonal elements of Λ are the eigenvalues of A in ascending order.

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by syevd_bufferSize().

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong. If devInfo = i (greater than zero), i off-diagonal elements of an intermediate tridiagonal form did not converge to zero.

if jobz = CUSOLVER_EIG_MODE_VECTOR, A contains the orthonormal eigenvectors of the matrix A. The eigenvectors are computed by a divide and conquer algorithm.

Appendix E.1 provides a simple example of syevd.

API of syevd
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. jobz host input specifies options to either compute eigenvalue only or compute eigen-pair: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute eigenvalues only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute eigenvalues and eigenvectors. uplo host input specifies which part of A is stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A is stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A is stored. n host input number of rows (or columns) of matrix A. A device in/out array of dimension lda * n with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A. On exit, if jobz = CUSOLVER_EIG_MODE_VECTOR, and devInfo = 0, A contains the orthonormal eigenvectors of the matrix A. If jobz = CUSOLVER_EIG_MODE_NOVECTOR, the contents of A are destroyed. lda host input leading dimension of two-dimensional array used to store matrix A. W device output a real array of dimension n. The eigenvalue values of A, in ascending order ie, sorted so that W(i) <= W(i+1). work device in/out working space, array of size lwork. Lwork host input size of work, returned by syevd_bufferSize. devInfo device output if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo = i (> 0), devInfo indicates i off-diagonal elements of an intermediate tridiagonal form did not converge to zero;
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0, or lda

### cusolverDn<t>sygvd()

The helper functions below can calculate the sizes needed for pre-allocated buffer.
```cusolverStatus_t
cusolverDnSsygvd_bufferSize(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const float *A,
int lda,
const float *B,
int ldb,
const float *W,
int *lwork);

cusolverStatus_t
cusolverDnDsygvd_bufferSize(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const double *A,
int lda,
const double *B,
int ldb,
const double *W,
int *lwork);

cusolverStatus_t
cusolverDnChegvd_bufferSize(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const cuComplex *A,
int lda,
const cuComplex *B,
int ldb,
const float *W,
int *lwork);

cusolverStatus_t
cusolverDnZhegvd_bufferSize(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *B,
int ldb,
const double *W,
int *lwork);

```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSsygvd(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
float *B,
int ldb,
float *W,
float *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnDsygvd(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
double *B,
int ldb,
double *W,
double *work,
int lwork,
int *devInfo);

```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnChegvd(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
cuComplex *B,
int ldb,
float *W,
cuComplex *work,
int lwork,
int *devInfo);

cusolverStatus_t
cusolverDnZhegvd(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
cuDoubleComplex *B,
int ldb,
double *W,
cuDoubleComplex *work,
int lwork,
int *devInfo);

```

This function computes eigenvalues and eigenvectors of a symmetric (Hermitian) n×n matrix-pair (A,B). The generalized symmetric-definite eigenvalue problem is

where the matrix B is positive definite. Λ is a real n×n diagonal matrix. The diagonal elements of Λ are the eigenvalues of (A, B) in ascending order. V is an n×n orthogonal matrix. The eigenvectors are normalized as follows:

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is size of the working space, and it is returned by sygvd_bufferSize().

If output parameter devInfo = -i (less than zero), the i-th parameter is wrong. If devInfo = i (i > 0 and i<=n) and jobz = CUSOLVER_EIG_MODE_NOVECTOR, i off-diagonal elements of an intermediate tridiagonal form did not converge to zero. If devInfo = N + i (i > 0), then the leading minor of order i of B is not positive definite. The factorization of B could not be completed and no eigenvalues or eigenvectors were computed.

if jobz = CUSOLVER_EIG_MODE_VECTOR, A contains the orthogonal eigenvectors of the matrix A. The eigenvectors are computed by divide and conquer algorithm.

Appendix E.2 provides a simple example of sygvd.

API of sygvd
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. itype host input Specifies the problem type to be solved: itype=CUSOLVER_EIG_TYPE_1: A*x = (lambda)*B*x. itype=CUSOLVER_EIG_TYPE_2: A*B*x = (lambda)*x. itype=CUSOLVER_EIG_TYPE_3: B*A*x = (lambda)*x. jobz host input specifies options to either compute eigenvalue only or compute eigen-pair: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute eigenvalues only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute eigenvalues and eigenvectors. uplo host input specifies which part of A and B are stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A and B are stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A and B are stored. n host input number of rows (or columns) of matrix A and B. A device in/out array of dimension lda * n with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A. On exit, if jobz = CUSOLVER_EIG_MODE_VECTOR, and devInfo = 0, A contains the orthonormal eigenvectors of the matrix A. If jobz = CUSOLVER_EIG_MODE_NOVECTOR, the contents of A are destroyed. lda host input leading dimension of two-dimensional array used to store matrix A. lda is not less than max(1,n). B device in/out array of dimension ldb * n. If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of B contains the upper triangular part of the matrix B. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of B contains the lower triangular part of the matrix B. On exit, if devInfo is less than n, B is overwritten by triangular factor U or L from the Cholesky factorization of B. ldb host input leading dimension of two-dimensional array used to store matrix B. ldb is not less than max(1,n). W device output a real array of dimension n. The eigenvalue values of A, sorted so that W(i) >= W(i+1). work device in/out working space, array of size lwork. Lwork host input size of work, returned by sygvd_bufferSize. devInfo device output if devInfo = 0, the operation is successful. if devInfo = -i, the i-th parameter is wrong. if devInfo = i (> 0), devInfo indicates either potrf or syevd is wrong.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0, or lda

### cusolverDn<t>syevj()

The helper functions below can calculate the sizes needed for pre-allocated buffer.
```cusolverStatus_t
cusolverDnSsyevj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const float *A,
int lda,
const float *W,
int *lwork,
syevjInfo_t params);

cusolverStatus_t
cusolverDnDsyevj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const double *A,
int lda,
const double *W,
int *lwork,
syevjInfo_t params);

cusolverStatus_t
cusolverDnCheevj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const cuComplex *A,
int lda,
const float *W,
int *lwork,
syevjInfo_t params);

cusolverStatus_t
cusolverDnZheevj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const cuDoubleComplex *A,
int lda,
const double *W,
int *lwork,
syevjInfo_t params);
```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSsyevj(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
float *W,
float *work,
int lwork,
int *info,
syevjInfo_t params);

cusolverStatus_t
cusolverDnDsyevj(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
double *W,
double *work,
int lwork,
int *info,
syevjInfo_t params);
```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCheevj(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
float *W,
cuComplex *work,
int lwork,
int *info,
syevjInfo_t params);

cusolverStatus_t
cusolverDnZheevj(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
double *W,
cuDoubleComplex *work,
int lwork,
int *info,
syevjInfo_t params);

```

This function computes eigenvalues and eigenvectors of a symmetric (Hermitian) n×n matrix A. The standard symmetric eigenvalue problem is

 $A*Q=Q*\Lambda$

where Λ is a real n×n diagonal matrix. Q is an n×n unitary matrix. The diagonal elements of Λ are the eigenvalues of A in ascending order.

syevj has the same functionality as syevd. The difference is that syevd uses QR algorithm and syevj uses Jacobi method. The parallelism of Jacobi method gives GPU better performance on small and medium size matrices. Moreover the user can configure syevj to perform approximation up to certain accuracy.

How does it work?

syevj iteratively generates a sequence of unitary matrices to transform matrix A to the following form

 ${V}^{H}*A*V=W+E$

where W is diagonal and E is symmetric without diagonal.

During the iterations, the Frobenius norm of E decreases monotonically. As E goes down to zero, W is the set of eigenvalues. In practice, Jacobi method stops if

 ${\mathrm{||E||}}_{F}\le eps*{\mathrm{||A||}}_{F}$

where eps is given tolerance.

syevj has two parameters to control the accuracy. First parameter is tolerance (eps). The default value is machine accuracy but The user can use function cusolverDnXsyevjSetTolerance to set a priori tolerance. The second parameter is maximum number of sweeps which controls number of iterations of Jacobi method. The default value is 100 but the user can use function cusolverDnXsyevjSetMaxSweeps to set a proper bound. The experimentis show 15 sweeps are good enough to converge to machine accuracy. syevj stops either tolerance is met or maximum number of sweeps is met.

Jacobi method has quadratic convergence, so the accuracy is not proportional to number of sweeps. To guarantee certain accuracy, the user should configure tolerance only.

After syevj, the user can query residual by function cusolverDnXsyevjGetResidual and number of executed sweeps by function cusolverDnXsyevjGetSweeps. However the user needs to be aware that residual is the Frobenius norm of E, not accuracy of individual eigenvalue, i.e.

 $\mathrm{residual}={\mathrm{||E||}}_{F}={\mathrm{||}\Lambda -W\mathrm{||}}_{F}$

The same as syevd, the user has to provide working space pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by syevj_bufferSize().

If output parameter info = -i (less than zero), the i-th parameter is wrong. If info = n+1, syevj does not converge under given tolerance and maximum sweeps.

If the user sets an improper tolerance, syevj may not converge. For example, tolerance should not be smaller than machine accuracy.

if jobz = CUSOLVER_EIG_MODE_VECTOR, A contains the orthonormal eigenvectors V.

Appendix E.3 provides a simple example of syevj.

API of syevj
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. jobz host input specifies options to either compute eigenvalue only or compute eigen-pair: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute eigenvalues only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute eigenvalues and eigenvectors. uplo host input specifies which part of A is stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A is stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A is stored. n host input number of rows (or columns) of matrix A. A device in/out array of dimension lda * n with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A. On exit, if jobz = CUSOLVER_EIG_MODE_VECTOR, and info = 0, A contains the orthonormal eigenvectors of the matrix A. If jobz = CUSOLVER_EIG_MODE_NOVECTOR, the contents of A are destroyed. lda host input leading dimension of two-dimensional array used to store matrix A. W device output a real array of dimension n. The eigenvalue values of A, in ascending order ie, sorted so that W(i) <= W(i+1). work device in/out working space, array of size lwork. lwork host input size of work, returned by syevj_bufferSize. info device output if info = 0, the operation is successful. if info = -i, the i-th parameter is wrong. if info = n+1, syevj dose not converge under given tolerance and maximum sweeps. params host in/out structure filled with parameters of Jacobi algorithm and results of syevj.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0, or lda

### cusolverDn<t>sygvj()

The helper functions below can calculate the sizes needed for pre-allocated buffer.
```cusolverStatus_t
cusolverDnSsygvj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const float *A,
int lda,
const float *B,
int ldb,
const float *W,
int *lwork,
syevjInfo_t params);

cusolverStatus_t
cusolverDnDsygvj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const double *A,
int lda,
const double *B,
int ldb,
const double *W,
int *lwork,
syevjInfo_t params);

cusolverStatus_t
cusolverDnChegvj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const cuComplex *A,
int lda,
const cuComplex *B,
int ldb,
const float *W,
int *lwork,
syevjInfo_t params);

cusolverStatus_t
cusolverDnZhegvj_bufferSize(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const cuDoubleComplex *A,
int lda,
const cuDoubleComplex *B,
int ldb,
const double *W,
int *lwork,
syevjInfo_t params);

```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSsygvj(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
float *B,
int ldb,
float *W,
float *work,
int lwork,
int *info,
syevjInfo_t params);

cusolverStatus_t
cusolverDnDsygvj(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
double *B,
int ldb,
double *W,
double *work,
int lwork,
int *info,
syevjInfo_t params);

```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnChegvj(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
cuComplex *B,
int ldb,
float *W,
cuComplex *work,
int lwork,
int *info,
syevjInfo_t params);

cusolverStatus_t
cusolverDnZhegvj(
cusolverDnHandle_t handle,
cusolverEigType_t itype,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
cuDoubleComplex *B,
int ldb,
double *W,
cuDoubleComplex *work,
int lwork,
int *info,
syevjInfo_t params);

```

This function computes eigenvalues and eigenvectors of a symmetric (Hermitian) n×n matrix-pair (A,B). The generalized symmetric-definite eigenvalue problem is

where the matrix B is positive definite. Λ is a real n×n diagonal matrix. The diagonal elements of Λ are the eigenvalues of (A, B) in ascending order. V is an n×n orthogonal matrix. The eigenvectors are normalized as follows:

This function has the same functionality as sygvd except that syevd in sygvd is replaced by syevj in sygvj. Therefore, sygvj inherits properties of syevj, the user can use cusolverDnXsyevjSetTolerance and cusolverDnXsyevjSetMaxSweeps to configure tolerance and maximum sweeps.

However the meaning of residual is different from syevj. sygvj first computes Cholesky factorization of matrix B,

 $B=L*{L}^{H}$

transform the problem to standard eigenvalue problem, then calls syevj.

For example, the standard eigenvalue problem of type I is

 $M*Q=Q*\Lambda$

where matrix M is symmtric

 $M={L}^{\mathrm{-1}}*A*{L}^{\mathrm{-H}}$

The residual is the result of syevj on matrix M, not A.

The user has to provide working space which is pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by sygvj_bufferSize().

If output parameter info = -i (less than zero), the i-th parameter is wrong. If info = i (i > 0 and i<=n), B is not positive definite, the factorization of B could not be completed and no eigenvalues or eigenvectors were computed. If info = n+1, syevj does not converge under given tolerance and maximum sweeps. In this case, the eigenvalues and eigenvectors are still computed because non-convergence comes from improper tolerance of maximum sweeps.

if jobz = CUSOLVER_EIG_MODE_VECTOR, A contains the orthogonal eigenvectors V.

Appendix E.4 provides a simple example of sygvj.

API of sygvj
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. itype host input Specifies the problem type to be solved: itype=CUSOLVER_EIG_TYPE_1: A*x = (lambda)*B*x. itype=CUSOLVER_EIG_TYPE_2: A*B*x = (lambda)*x. itype=CUSOLVER_EIG_TYPE_3: B*A*x = (lambda)*x. jobz host input specifies options to either compute eigenvalue only or compute eigen-pair: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute eigenvalues only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute eigenvalues and eigenvectors. uplo host input specifies which part of A and B are stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of A and B are stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of A and B are stored. n host input number of rows (or columns) of matrix A and B. A device in/out array of dimension lda * n with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of A contains the upper triangular part of the matrix A. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of A contains the lower triangular part of the matrix A. On exit, if jobz = CUSOLVER_EIG_MODE_VECTOR, and info = 0, A contains the orthonormal eigenvectors of the matrix A. If jobz = CUSOLVER_EIG_MODE_NOVECTOR, the contents of A are destroyed. lda host input leading dimension of two-dimensional array used to store matrix A. lda is not less than max(1,n). B device in/out array of dimension ldb * n. If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of B contains the upper triangular part of the matrix B. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of B contains the lower triangular part of the matrix B. On exit, if info is less than n, B is overwritten by triangular factor U or L from the Cholesky factorization of B. ldb host input leading dimension of two-dimensional array used to store matrix B. ldb is not less than max(1,n). W device output a real array of dimension n. The eigenvalue values of A, sorted so that W(i) >= W(i+1). work device in/out working space, array of size lwork. lwork host input size of work, returned by sygvj_bufferSize. info device output if info = 0, the operation is successful. if info = -i, the i-th parameter is wrong. if info = i (> 0), info indicates either B is not positive definite or syevj (called by sygvj) does not converge.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0, or lda

### cusolverDn<t>syevjBatched()

The helper functions below can calculate the sizes needed for pre-allocated buffer.
```cusolverStatus_t
cusolverDnSsyevjBatched_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const float *A,
int lda,
const float *W,
int *lwork,
syevjInfo_t params,
int batchSize
);

cusolverStatus_t
cusolverDnDsyevjBatched_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const double *A,
int lda,
const double *W,
int *lwork,
syevjInfo_t params,
int batchSize
);

cusolverStatus_t
cusolverDnCheevjBatched_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const cuComplex *A,
int lda,
const float *W,
int *lwork,
syevjInfo_t params,
int batchSize
);

cusolverStatus_t
cusolverDnZheevjBatched_bufferSize(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
const cuDoubleComplex *A,
int lda,
const double *W,
int *lwork,
syevjInfo_t params,
int batchSize
);

```
The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnSsyevjBatched(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
float *A,
int lda,
float *W,
float *work,
int lwork,
int *info,
syevjInfo_t params,
int batchSize
);

cusolverStatus_t
cusolverDnDsyevjBatched(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
double *A,
int lda,
double *W,
double *work,
int lwork,
int *info,
syevjInfo_t params,
int batchSize
);

```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverDnCheevjBatched(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
cuComplex *A,
int lda,
float *W,
cuComplex *work,
int lwork,
int *info,
syevjInfo_t params,
int batchSize
);

cusolverStatus_t
cusolverDnZheevjBatched(
cusolverDnHandle_t handle,
cusolverEigMode_t jobz,
cublasFillMode_t uplo,
int n,
cuDoubleComplex *A,
int lda,
double *W,
cuDoubleComplex *work,
int lwork,
int *info,
syevjInfo_t params,
int batchSize
);

```

This function computes eigenvalues and eigenvectors of a squence of symmetric (Hermitian) n×n matrices

 ${A}_{j}*{Q}_{j}={Q}_{j}*{\Lambda }_{j}$

where ${\Lambda }_{j}$ is a real n×n diagonal matrix. ${Q}_{j}$ is an n×n unitary matrix. The diagonal elements of ${\Lambda }_{j}$ are the eigenvalues of ${A}_{j}$ in either ascending order or non-sorting order.

syevjBatched performs syevj on each matrix. It requires that all matrices are of the same size n no greater than 32 and are packed in contiguous way,

 $A=\left(\begin{array}{ccc}\mathrm{A0}& \mathrm{A1}& \cdots \end{array}\right)$

Each matrix is column-major with leading dimension lda, so the formula for random access is ${A}_{k}\left(i,j\right)=\mathrm{A\left[ i + lda*j + lda*n*k\right]}$ .

The parameter W also contains eigenvalues of each matrix in contiguous way,

 $W=\left(\begin{array}{ccc}\mathrm{W0}& \mathrm{W1}& \cdots \end{array}\right)$

The formula for random access of W is ${W}_{k}\left(j\right)=\mathrm{W\left[ j + n*k\right]}$ .

Except for tolerance and maximum sweeps, syevjBatched can either sort the eigenvalues in ascending order (default) or chose as-is (without sorting) by the function cusolverDnXsyevjSetSortEig. If the user packs several tiny matrices into diagonal blocks of one matrix, non-sorting option can separate spectrum of those tiny matrices.

syevjBatched cannot report residual and executed sweeps by function cusolverDnXsyevjGetResidual and cusolverDnXsyevjGetSweeps. Any call of the above two returns CUSOLVER_STATUS_NOT_SUPPORTED. The user needs to compute residual explicitly.

The user has to provide working space pointed by input parameter work. The input parameter lwork is the size of the working space, and it is returned by syevjBatched_bufferSize().

The output parameter info is an integer array of size batchSize. If the function returns CUSOLVER_STATUS_INVALID_VALUE, the first element info[0] = -i (less than zero) indicates i-th parameter is wrong. Otherwise, if info[i] = n+1, syevjBatched does not converge on i-th matrix under given tolerance and maximum sweeps.

if jobz = CUSOLVER_EIG_MODE_VECTOR, ${A}_{j}$ contains the orthonormal eigenvectors ${V}_{j}$ .

Appendix E.5 provides a simple example of syevjBatched.

API of syevjBatched
 parameter Memory In/out Meaning handle host input handle to the cuSolverDN library context. jobz host input specifies options to either compute eigenvalue only or compute eigen-pair: jobz = CUSOLVER_EIG_MODE_NOVECTOR : Compute eigenvalues only; jobz = CUSOLVER_EIG_MODE_VECTOR : Compute eigenvalues and eigenvectors. uplo host input specifies which part of Aj is stored. uplo = CUBLAS_FILL_MODE_LOWER: Lower triangle of Aj is stored. uplo = CUBLAS_FILL_MODE_UPPER: Upper triangle of Aj is stored. n host input number of rows (or columns) of matrix each Aj. n is no greater than 32. A device in/out array of dimension lda * n * batchSize with lda is not less than max(1,n). If uplo = CUBLAS_FILL_MODE_UPPER, the leading n-by-n upper triangular part of Aj contains the upper triangular part of the matrix Aj. If uplo = CUBLAS_FILL_MODE_LOWER, the leading n-by-n lower triangular part of Aj contains the lower triangular part of the matrix Aj. On exit, if jobz = CUSOLVER_EIG_MODE_VECTOR, and info[j] = 0, Aj contains the orthonormal eigenvectors of the matrix Aj. If jobz = CUSOLVER_EIG_MODE_NOVECTOR, the contents of Aj are destroyed. lda host input leading dimension of two-dimensional array used to store matrix Aj. W device output a real array of dimension n*batchSize. It stores the eigenvalues of Aj in ascending order or non-sorting order. work device in/out array of size lwork, workspace. lwork host input size of work, returned by syevjBatched_bufferSize. info device output an integer array of dimension batchSize. If CUSOLVER_STATUS_INVALID_VALUE is returned, info[0] = -i (less than zero) indicates i-th parameter is wrong. Otherwise, if info[i] = 0, the operation is successful. if info[i] = n+1, syevjBatched dose not converge on i-th matrix under given tolerance and maximum sweeps. params host in/out structure filled with parameters of Jacobi algorithm. batchSize host input number of matrices.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n<0, n>32 or lda

## cuSolverSP: sparse LAPACK Function Reference

This chapter describes the API of cuSolverSP, which provides a subset of LAPACK funtions for sparse matrices in CSR or CSC format.

### cusolverSpCreate()

```cusolverStatus_t
cusolverSpCreate(cusolverSpHandle_t *handle)
```

This function initializes the cuSolverSP library and creates a handle on the cuSolver context. It must be called before any other cuSolverSP API function is invoked. It allocates hardware resources necessary for accessing the GPU.

Output
 handle the pointer to the handle to the cuSolverSP context.
Status Returned
 CUSOLVER_STATUS_SUCCESS the initialization succeeded. CUSOLVER_STATUS_NOT_INITIALIZED the CUDA Runtime initialization failed. CUSOLVER_STATUS_ALLOC_FAILED the resources could not be allocated. CUSOLVER_STATUS_ARCH_MISMATCH the device only supports compute capability 2.0 and above.

### cusolverSpDestroy()

```cusolverStatus_t
cusolverSpDestroy(cusolverSpHandle_t handle)```

This function releases CPU-side resources used by the cuSolverSP library.

Input
 handle the handle to the cuSolverSP context.
Status Returned
 CUSOLVER_STATUS_SUCCESS the shutdown succeeded. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized.

### cusolverSpSetStream()

```cusolverStatus_t
cusolverSpSetStream(cusolverSpHandle_t handle, cudaStream_t streamId)
```

This function sets the stream to be used by the cuSolverSP library to execute its routines.

Input
 handle the handle to the cuSolverSP context. streamId the stream to be used by the library.
Status Returned
 CUSOLVER_STATUS_SUCCESS the stream was set successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized.

### cusolverSpXcsrissym()

```
cusolverStatus_t
cusolverSpXcsrissymHost(cusolverSpHandle_t handle,
int m,
int nnzA,
const cusparseMatDescr_t descrA,
const int *csrRowPtrA,
const int *csrEndPtrA,
const int *csrColIndA,
int *issym);

```

This function checks if A has symmetric pattern or not. The output parameter issym reports 1 if A is symmetric; otherwise, it reports 0.

The matrix A is an m×m sparse matrix that is defined in CSR storage format by the four arrays csrValA, csrRowPtrA, csrEndPtrA and csrColIndA.

The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL.

The csrlsvlu and csrlsvqr do not accept non-general matrix. the user has to extend the matrix into its missing upper/lower part, otherwise the result is not expected. The user can use csrissym to check if the matrix has symmetric pattern or not.

Remark 1: only CPU path is provided.

Remark 2: the user has to check returned status to get valid information. The function converts A to CSC format and compare CSR and CSC format. If the CSC failed because of insufficient resources, issym is undefined, and this state can only be detected by the return status code.

Input
 parameter MemorySpace description handle host handle to the cuSolverSP library context. m host number of rows and columns of matrix A. nnzA host number of nonzeros of matrix A. It is the size of csrValA and csrColIndA. descrA host the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. csrRowPtrA host integer array of m elements that contains the start of every row. csrEndPtrA host integer array of m elements that contains the end of the last row plus one. csrColIndA host integer array of nnzAcolumn indices of the nonzero elements of matrix A.
Output
 parameter MemorySpace description issym host 1 if A is symmetric; 0 otherwise.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_ALLOC_FAILED the resources could not be allocated. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,nnzA<=0), base index is not 0 or 1. CUSOLVER_STATUS_ARCH_MISMATCH the device only supports compute capability 2.0 and above. CUSOLVER_STATUS_INTERNAL_ERROR an internal operation failed. CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED the matrix type is not supported.

### High Level Function Reference

This section describes high level API of cuSolverSP, including linear solver, least-square solver and eigenvalue solver. The high-level API is designed for ease-of-use, so it allocates any required memory under the hood automatically. If the host or GPU system memory is not enough, an error is returned.

### 6.2.1. cusolverSp<t>csrlsvlu()

```cusolverStatus_t
cusolverSpScsrlsvlu[Host](cusolverSpHandle_t handle,
int n,
int nnzA,
const cusparseMatDescr_t descrA,
const float *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const float *b,
float tol,
int reorder,
float *x,
int *singularity);

cusolverStatus_t
cusolverSpDcsrlsvlu[Host](cusolverSpHandle_t handle,
int n,
int nnzA,
const cusparseMatDescr_t descrA,
const double *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const double *b,
double tol,
int reorder,
double *x,
int *singularity);

cusolverStatus_t
cusolverSpCcsrlsvlu[Host](cusolverSpHandle_t handle,
int n,
int nnzA,
const cusparseMatDescr_t descrA,
const cuComplex *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const cuComplex *b,
float tol,
int reorder,
cuComplex *x,
int *singularity);

cusolverStatus_t
cusolverSpZcsrlsvlu[Host](cusolverSpHandle_t handle,
int n,
int nnzA,
const cusparseMatDescr_t descrA,
const cuDoubleComplex *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const cuDoubleComplex *b,
double tol,
int reorder,
cuDoubleComplex *x,
int *singularity);

```

This function solves the linear system

 $A*x=b$

A is an n×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA. b is the right-hand-side vector of size n, and x is the solution vector of size n.

The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. If matrix A is symmetric/Hermitian and only lower/upper part is used or meaningful, the user has to extend the matrix into its missing upper/lower part, otherwise the result would be wrong.

The linear system is solved by sparse LU with partial pivoting,

 $P*A=L*U$

cusolver library provides two reordering schemes, symrcm and symamd, to reduce zero fill-in which dramactically affects the performance of LU factorization. The input parameter reorder can enable symrcm (or symamd) if reorder is 1 (or 2), otherwise, no reordering is performed.

If reorder is nonzero, csrlsvlu does

 $P*A*{Q}^{T}=L*U$

where $Q=\mathrm{symrcm}\left(A+{A}^{T}\right)$ .

If A is singular under given tolerance (max(tol,0)), then some diagonal elements of U is zero, i.e.

 $\mathrm{|U\left(j,j\right)|}<\mathrm{tol for some j}$

The output parameter singularity is the smallest index of such j. If A is non-singular, singularity is -1. The index is base-0, independent of base index of A. For example, if 2nd column of A is the same as first column, then A is singular and singularity = 1 which means U(1,1)≈0.

Remark 1: csrlsvlu performs traditional LU with partial pivoting, the pivot of k-th column is determined dynamically based on the k-th column of intermediate matrix. csrlsvlu follows Gilbert and Peierls's algorithm [4] which uses depth-first-search and topological ordering to solve triangular system (Davis also describes this algorithm in detail in his book [1]). Before performing LU factorization, csrlsvlu over-estimates size of L and U, and allocates a buffer to contain factors L and U. George and Ng [5] proves that sparsity pattern of cholesky factor of $A*{A}^{T}$ is a superset of sparsity pattern of L and U. Furthermore, they propose an algorithm to find sparisty pattern of QR factorization which is a superset of LU [6]. csrlsvlu uses QR factorization to estimate size of LU in the analysis phase. The cost of analysis phase is mainly on figuring out sparsity pattern of householder vectors in QR factorization. The idea to avoid computing $A*{A}^{T}$ in [7] is adopted. If system memory is insufficient to keep sparsity pattern of QR, csrlsvlu returns CUSOLVER_STATUS_ALLOC_FAILED. If the matrix is not banded, it is better to enable reordering to avoid CUSOLVER_STATUS_ALLOC_FAILED.

Remark 2: approximate minimum degree ordering (symamd) is a well-known technique to reduce zero fill-in of QR factorization. However in most cases, symrcm still performs well.

Remark 3: only CPU (Host) path is provided.

Remark 4: multithreaded csrlsvlu is not avaiable yet. If QR does not incur much zero fill-in, csrlsvqr would be faster than csrlsvlu.

Input
 parameter cusolverSp MemSpace *Host MemSpace description handle host host handle to the cuSolverSP library context. n host host number of rows and columns of matrix A. nnzA host host number of nonzeros of matrix A. descrA host host the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. csrValA device host array of nnzA$\left(=$csrRowPtrA(n)$-$csrRowPtrA(0)$\right)$ nonzero elements of matrix A. csrRowPtrA device host integer array of n$+1$ elements that contains the start of every row and the end of the last row plus one. csrColIndA device host integer array of nnzA$\left(=$csrRowPtrA(n)$-$csrRowPtrA(0)$\right)$ column indices of the nonzero elements of matrix A. b device host right hand side vector of size n. tol host host tolerance to decide if singular or not. reorder host host no ordering if reorder=0. Otherwise, symrcm is used to reduce zero fill-in.
Output
 parameter cusolverSp MemSpace *Host MemSpace description x device host solution vector of size n, x = inv(A)*b. singularity host host -1 if A is invertible. Otherwise, first index j such that U(j,j)≈0
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_ALLOC_FAILED the resources could not be allocated. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (n,nnzA<=0), base index is not 0 or 1. CUSOLVER_STATUS_ARCH_MISMATCH the device only supports compute capability 2.0 and above. CUSOLVER_STATUS_INTERNAL_ERROR an internal operation failed. CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED the matrix type is not supported.

### 6.2.2. cusolverSp<t>csrlsvqr()

```cusolverStatus_t
cusolverSpScsrlsvqr[Host](cusolverSpHandle_t handle,
int m,
int nnz,
const cusparseMatDescr_t descrA,
const float *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const float *b,
float tol,
int reorder,
float *x,
int *singularity);

cusolverStatus_t
cusolverSpDcsrlsvqr[Host](cusolverSpHandle_t handle,
int m,
int nnz,
const cusparseMatDescr_t descrA,
const double *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const double *b,
double tol,
int reorder,
double *x,
int *singularity);

cusolverStatus_t
cusolverSpCcsrlsvqr[Host](cusolverSpHandle_t handle,
int m,
int nnz,
const cusparseMatDescr_t descrA,
const cuComplex *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const cuComplex *b,
float tol,
int reorder,
cuComplex *x,
int *singularity);

cusolverStatus_t
cusolverSpZcsrlsvqr[Host](cusolverSpHandle_t handle,
int m,
int nnz,
const cusparseMatDescr_t descrA,
const cuDoubleComplex *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const cuDoubleComplex *b,
double tol,
int reorder,
cuDoubleComplex *x,
int *singularity);

```

This function solves the linear system

 $A*x=b$

A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA. b is the right-hand-side vector of size m, and x is the solution vector of size m.

The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. If matrix A is symmetric/Hermitian and only lower/upper part is used or meaningful, the user has to extend the matrix into its missing upper/lower part, otherwise the result would be wrong.

The linear system is solved by sparse QR factorization,

 $\mathrm{A = Q*R}$

If A is singular under given tolerance (max(tol,0)), then some diagonal elements of R is zero, i.e.

 $\mathrm{|R\left(j,j\right)|}<\mathrm{tol for some j}$

The output parameter singularity is the smallest index of such j. If A is non-singular, singularity is -1. The singularity is base-0, independent of base index of A. For example, if 2nd column of A is the same as first column, then A is singular and singularity = 1 which means R(1,1)≈0.

cusolver library provides two reordering schemes, symrcm and symamd, to reduce zero fill-in which dramactically affects the performance of LU factorization. The input parameter reorder can enable symrcm (or symamd) if reorder is 1 (or 2), otherwise, no reordering is performed.

Input
 parameter cusolverSp MemSpace *Host MemSpace description handle host host handle to the cuSolverSP library context. m host host number of rows and columns of matrix A. nnz host host number of nonzeros of matrix A. descrA host host the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. csrValA device host array of nnz$\left(=$csrRowPtrA(m)$-$csrRowPtrA(0)$\right)$ nonzero elements of matrix A. csrRowPtrA device host integer array of m$+1$ elements that contains the start of every row and the end of the last row plus one. csrColIndA device host integer array of nnz$\left(=$csrRowPtrA(m)$-$csrRowPtrA(0)$\right)$ column indices of the nonzero elements of matrix A. b device host right hand side vector of size m. tol host host tolerance to decide if singular or not. reorder host host no effect.
Output
 parameter cusolverSp MemSpace *Host MemSpace description x device host solution vector of size m, x = inv(A)*b. singularity host host -1 if A is invertible. Otherwise, first index j such that R(j,j)≈0
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_ALLOC_FAILED the resources could not be allocated. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,nnz<=0), base index is not 0 or 1. CUSOLVER_STATUS_ARCH_MISMATCH the device only supports compute capability 2.0 and above. CUSOLVER_STATUS_INTERNAL_ERROR an internal operation failed. CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED the matrix type is not supported.

### 6.2.3. cusolverSp<t>csrlsvchol()

```cusolverStatus_t
cusolverSpScsrlsvchol[Host](cusolverSpHandle_t handle,
int m,
int nnz,
const cusparseMatDescr_t descrA,
const float *csrVal,
const int *csrRowPtr,
const int *csrColInd,
const float *b,
float tol,
int reorder,
float *x,
int *singularity);

cusolverStatus_t
cusolverSpDcsrlsvchol[Host](cusolverSpHandle_t handle,
int m,
int nnz,
const cusparseMatDescr_t descrA,
const double *csrVal,
const int *csrRowPtr,
const int *csrColInd,
const double *b,
double tol,
int reorder,
double *x,
int *singularity);

cusolverStatus_t
cusolverSpCcsrlsvchol[Host](cusolverSpHandle_t handle,
int m,
int nnz,
const cusparseMatDescr_t descrA,
const cuComplex *csrVal,
const int *csrRowPtr,
const int *csrColInd,
const cuComplex *b,
float tol,
int reorder,
cuComplex *x,
int *singularity);

cusolverStatus_t
cusolverSpZcsrlsvchol[Host](cusolverSpHandle_t handle,
int m,
int nnz,
const cusparseMatDescr_t descrA,
const cuDoubleComplex *csrVal,
const int *csrRowPtr,
const int *csrColInd,
const cuDoubleComplex *b,
double tol,
int reorder,
cuDoubleComplex *x,
int *singularity);

```

This function solves the linear system

 $A*x=b$

A is an m×m symmetric postive definite sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA. b is the right-hand-side vector of size m, and x is the solution vector of size m.

The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL and upper triangular part of A is ignored (if parameter reorder is zero). In other words, suppose input matrix A is decomposed as $A=L+D+U$ , where L is lower triangular, D is diagonal and U is upper triangular. The function would ignore U and regard A as a symmetric matrix with the formula $A=L+D+{L}^{H}$ . If parameter reorder is nonzero, the user has to extend A to a full matrix, otherwise the solution would be wrong.

The linear system is solved by sparse Cholesky factorization,

 $A=G*{G}^{H}$

where G is the Cholesky factor, a lower triangular matrix.

The output parameter singularity has two meanings:

• If A is not postive definite, there exists some integer k such that A(0:k, 0:k) is not positive definite. singularity is the minimum of such k.
• If A is postive definite but near singular under tolerance (max(tol,0)), i.e. there exists some integer k such that $G\left(\begin{array}{c}\mathrm{k,k}\end{array}\right)<=\mathrm{tol}$ . singularity is the minimum of such k.

singularity is base-0. If A is positive definite and not near singular under tolerance, singularity is -1. If the user wants to know if A is postive definite or not, tol=0 is enough.

cusolver library provides two reordering schemes, symrcm and symamd, to reduce zero fill-in which dramactically affects the performance of LU factorization. The input parameter reorder can enable symrcm (or symamd) if reorder is 1 (or 2), otherwise, no reordering is performed.

Remark 1: the function works for in-place (x and b point to the same memory block) and out-of-place.

Remark 2: the function only works on 32-bit index, if matrix G has large zero fill-in such that number of nonzeros is bigger than ${2}^{31}$ , then CUSOLVER_STATUS_ALLOC_FAILED is returned.

Input
 parameter cusolverSp MemSpace *Host MemSpace description handle host host handle to the cuSolverSP library context. m host host number of rows and columns of matrix A. nnz host host number of nonzeros of matrix A. descrA host host the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. csrValA device host array of nnz$\left(=$csrRowPtrA(m)$-$csrRowPtrA(0)$\right)$ nonzero elements of matrix A. csrRowPtrA device host integer array of m$+1$ elements that contains the start of every row and the end of the last row plus one. csrColIndA device host integer array of nnz$\left(=$csrRowPtrA(m)$-$csrRowPtrA(0)$\right)$ column indices of the nonzero elements of matrix A. b device host right hand side vector of size m. tol host host tolerance to decide singularity. reorder host host no effect.
Output
 parameter cusolverSp MemSpace *Host MemSpace description x device host solution vector of size m, x = inv(A)*b. singularity host host -1 if A is symmetric postive definite.
Status Returned
 CUSOLVER_STATUS_SUCCESS the operation completed successfully. CUSOLVER_STATUS_NOT_INITIALIZED the library was not initialized. CUSOLVER_STATUS_ALLOC_FAILED the resources could not be allocated. CUSOLVER_STATUS_INVALID_VALUE invalid parameters were passed (m,nnz<=0), base index is not 0 or 1. CUSOLVER_STATUS_ARCH_MISMATCH the device only supports compute capability 2.0 and above. CUSOLVER_STATUS_INTERNAL_ERROR an internal operation failed. CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED the matrix type is not supported.

### 6.2.4. cusolverSp<t>csrlsqvqr()

The S and D data types are real valued single and double precision, respectively.
```cusolverStatus_t
cusolverSpScsrlsqvqr[Host](cusolverSpHandle_t handle,
int m,
int n,
int nnz,
const cusparseMatDescr_t descrA,
const float *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const float *b,
float tol,
int *rankA,
float *x,
int *p,
float *min_norm);

cusolverStatus_t
cusolverSpDcsrlsqvqr[Host](cusolverSpHandle_t handle,
int m,
int n,
int nnz,
const cusparseMatDescr_t descrA,
const double *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const double *b,
double tol,
int *rankA,
double *x,
int *p,
double *min_norm);
```
The C and Z data types are complex valued single and double precision, respectively.
```cusolverStatus_t
cusolverSpCcsrlsqvqr[Host](cusolverSpHandle_t handle,
int m,
int n,
int nnz,
const cusparseMatDescr_t descrA,
const cuComplex *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const cuComplex *b,
float tol,
int *rankA,
cuComplex *x,
int *p,
float *min_norm);

cusolverStatus_t
cusolverSpZcsrlsqvqr[Host](cusolverSpHandle_t handle,
int m,
int n,
int nnz,
const cusparseMatDescr_t descrA,
const cuDoubleComplex *csrValA,
const int *csrRowPtrA,
const int *csrColIndA,
const cuDoubleComplex *b,
double tol,
int *rankA,
cuDoubleComplex *x,
int *p,
double *min_norm);
```

This function solves the following least-square problem

 $x=\mathrm{argmin}\mathrm{||}A*z-b\mathrm{||}$

A is an m×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA. b is the right-hand-side vector of size m, and x is the least-square solution vector of size n.

The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. If A is square, symmetric/Hermitian and only lower/upper part is used or meaningful, the user has to extend the matrix into its missing upper/lower part, otherwise the result is wrong.

This function only works if m is greater or equal to n, in other words, A is a tall matrix.

The least-square problem is solved by sparse QR factorization with column pivoting,

 $A*{P}^{T}=Q*R$

If A is of full rank (i.e. all columns of A are linear independent), then matrix P is an identity. Suppose rank of A is k, less than n, the permutation matrix P reorders columns of A in the following sense:

 $A*{P}^{T}=\left(\begin{array}{cc}{A}_{1}& {A}_{2}\end{array}\right)=\left(\begin{array}{cc}{Q}_{1}& {Q}_{2}\end{array}\right)\left(\begin{array}{cc}{R}_{11}& {R}_{12}\\ & {R}_{22}\end{array}\right)$

where ${R}_{11}$ and A have the same rank, but ${R}_{22}$ is almost zero, i.e. every column of ${A}_{2}$ is linear combination of ${A}_{1}$ .

The input parameter tol decides numerical rank. The absolute value of every entry in ${R}_{22}$ is less than or equal to tolerance=max(tol,0).

The output parameter rankA denotes numerical rank of A.

Suppose $y=P*x$ and $c={Q}^{H}*b$ , the least square problem can be reformed by

 $\mathrm{min}||A*x-b||=\mathrm{min}||R*y-c||$

or in matrix form

 $\left(\begin{array}{cc}{R}_{11}& {R}_{12}\\ & {R}_{22}\end{array}\right)\left(\begin{array}{c}{y}_{1}\\ {y}_{2}\end{array}\right)=\left(\begin{array}{c}{c}_{1}\\ {c}_{2}\end{array}\right)$

The output parameter min_norm is $||{c}_{2}||$ , which is minimum value of least-square problem.

If A is not of full rank, above equation does not have a unique solution. The least-square problem is equivalent to

 $\begin{array}{c}\mathrm{min}||y||\\ \mathrm{subject to}{R}_{11}*{y}_{1}+{R}_{12}*{y}_{2}={c}_{1}\end{array}$

Or equivalently another least-square problem

 $\mathrm{min||}\left(\begin{array}{c}{R}_{11}\{R}_{12}\\ I\end{array}\right)*{y}_{2}-\left(\begin{array}{c}{R}_{11}\{c}_{1}\\ O\end{array}\right)\mathrm{||}$

The output parameter x is ${P}^{T}*y$ , the solution of least-square problem.

The output parameter p is a vector of size n. It corresponds to a permutation matrix P. p(i)=j means (P*x)(i) = x(j). If A is of full rank, p=0:n-1.

Remark 1: p is always base 0, independent of base index of A.

Remark 2: only CPU (Host) path is provided.

Input
 parameter cusolverSp MemSpace *Host MemSpace description handle host host handle to the cuSolver library context. m host host number of rows of matrix A. n host host number of columns of matrix A.