## cuSPARSE

The API reference guide for cuSPARSE, the CUDA sparse matrix library.

## 1. Introduction

The cuSPARSE library contains a set of basic linear algebra subroutines used for handling sparse matrices. The library targets matrices with a number of (structural) zero elements which represent > 95% of the total entries.

Provide Feedback: Math-Libs-Feedback@nvidia.com

cuSPARSE Release Notes: cuda-toolkit-release-notes

It is implemented on top of the NVIDIA® CUDA™ runtime (which is part of the CUDA Toolkit) and is designed to be called from C and C++.

The library routines can be classified into four categories:

• Level 1: operations between a vector in sparse format and a vector in dense format
• Level 2: operations between a matrix in sparse format and a vector in dense format
• Level 3: operations between a matrix in sparse format and a set of vectors in dense format (which can also usually be viewed as a dense tall matrix)
• Conversion: operations that allow conversion between different matrix formats, and compression of csr matrices.

The cuSPARSE library allows developers to access the computational resources of the NVIDIA graphics processing unit (GPU), although it does not auto-parallelize across multiple GPUs. The cuSPARSE API assumes that input and output data reside in GPU (device) memory, unless it is explicitly indicated otherwise by the string DevHostPtr in a function parameter's name.

It is the responsibility of the developer to allocate memory and to copy data between GPU memory and CPU memory using standard CUDA runtime API routines, such as cudaMalloc(), cudaFree(), cudaMemcpy(), and cudaMemcpyAsync().

### 1.1. Naming Conventions

The cuSPARSE library functions are available for data types float, double, cuComplex, and cuDoubleComplex. The sparse Level 1, Level 2, and Level 3 functions follow this naming convention:

cusparse<t>[<matrix data format>]<operation>[<output matrix data format>]

where <t> can be S, D, C, Z, or X, corresponding to the data types float, double, cuComplex, cuDoubleComplex, and the generic type, respectively.

The <matrix data format> can be dense, coo, csr, or csc, corresponding to the dense, coordinate, compressed sparse row, and compressed sparse column formats, respectively.

Finally, the <operation> can be axpyi, gthr, gthrz, roti, or sctr, corresponding to the Level 1 functions; it also can be mv or sv, corresponding to the Level 2 functions, as well as mm or sm, corresponding to the Level 3 functions.

All of the functions have the return type cusparseStatus_t and are explained in more detail in the chapters that follow.

### 1.2. Asynchronous Execution

The cuSPARSE library functions are executed asynchronously with respect to the host and may return control to the application on the host before the result is ready. Developers can use the cudaDeviceSynchronize() function to ensure that the execution of a particular cuSPARSE library routine has completed.

A developer can also use the cudaMemcpy() routine to copy data from the device to the host and vice versa, using the cudaMemcpyDeviceToHost and cudaMemcpyHostToDevice parameters, respectively. In this case there is no need to add a call to cudaDeviceSynchronize() because the call to cudaMemcpy() with the above parameters is blocking and completes only when the results are ready on the host.

### 1.3. Static Library Support

Starting with release 6.5, the cuSPARSE Library is also delivered in a static form as libcusparse_static.a on Linux and Mac OSes. The static cuSPARSE library and all others static maths libraries depend on a common thread abstraction layer library called libculibos.a on Linux and Mac and culibos.lib on Windows.

For example, on linux, to compile a small application using cuSPARSE against the dynamic library, the following command can be used:

nvcc myCusparseApp.c  -lcusparse  -o myCusparseApp>

Whereas to compile against the static cuSPARSE library, the following command has to be used:

nvcc myCusparseApp.c  -lcusparse_static   -lculibos -o myCusparseApp>

It is also possible to use the native Host C++ compiler. Depending on the Host Operating system, some additional libraries like pthread or dl might be needed on the linking line. The following command on Linux is suggested :

g++ myCusparseApp.c  -lcusparse_static   -lculibos -lcudart_static -lpthread -ldl -I <cuda-toolkit-path>/include -L <cuda-toolkit-path>/lib64 -o myCusparseApp

Note that in the latter case, the library cuda is not needed. The CUDA Runtime will try to open explicitly the cuda library if needed. In the case of a system which does not have the CUDA driver installed, this allows the application to gracefully manage this issue and potentially run if a CPU-only path is available.

## 2. Using the cuSPARSE API

This chapter describes how to use the cuSPARSE library API. It is not a reference for the cuSPARSE API data types and functions; that is provided in subsequent chapters.

The library is thread safe and its functions can be called from multiple host threads. However, simultaneous read/writes of the same objects (or of the same handle) are not safe. Hence the handle must be private per thread, i.e., only one handle per thread is safe.

### 2.2. Scalar Parameters

In the cuSPARSE API, the scalar parameters $\alpha$ and $\beta$ can be passed by reference on the host or the device.

The few functions that return a scalar result, such as nnz(), return the resulting value by reference on the host or the device. Even though these functions return immediately, similarly to those that return matrix and vector results, the scalar result is not ready until execution of the routine on the GPU completes. This requires proper synchronization be used when reading the result from the host.

This feature allows the cuSPARSE library functions to execute completely asynchronously using streams, even when $\alpha$ and $\beta$ are generated by a previous kernel. This situation arises, for example, when the library is used to implement iterative methods for the solution of linear systems and eigenvalue problems [3].

### 2.3. Parallelism with Streams

If the application performs several small independent computations, or if it makes data transfers in parallel with the computation, CUDA streams can be used to overlap these tasks.

The application can conceptually associate a stream with each task. To achieve the overlap of computation between the tasks, the developer should create CUDA streams using the function cudaStreamCreate() and set the stream to be used by each individual cuSPARSE library routine by calling cusparseSetStream() just before calling the actual cuSPARSE routine. Then, computations performed in separate streams would be overlapped automatically on the GPU, when possible. This approach is especially useful when the computation performed by a single task is relatively small and is not enough to fill the GPU with work, or when there is a data transfer that can be performed in parallel with the computation.

When streams are used, we recommend using the new cuSPARSE API with scalar parameters and results passed by reference in the device memory to achieve maximum computational overlap.

Although a developer can create many streams, in practice it is not possible to have more than 16 concurrent kernels executing at the same time.

### 2.4. Compatibility and Versioning

The cuSPARSE APIs are intended to be backward compatible at the source level with future releases (unless stated otherwise in the release notes of a specific future release). In other words, if a program uses cuSPARSE, it should continue to compile and work correctly with newer versions of cuSPARSE without source code changes. cuSPARSE is not guaranteed to be backward compatible at the binary level. Using different versions of the cusparse.h header file and the shared library is not supported. Using different versions of cuSPARSE and the CUDA runtime is not supported. The APIs should be backward compatible at the source level for public functions in most cases

### 2.5. Optimization Notes

Most of the cuSPARSE routines can be optimized by exploiting CUDA Graphs capture and Hardware Memory Compression features.

More in details, a single cuSPARSE call or a sequence of calls can be captured by a CUDA Graph and executed in a second moment. This minimizes kernels launch overhead and allows the CUDA runtime to optimize the whole workflow. A full example of CUDA graphs capture applied to a cuSPARSE routine can be found in cuSPARSE Library Samples - CUDA Graph.

Secondly, the data types and functionalities involved in cuSPARSE are suitable for Hardware Memory Compression available in Ampere GPU devices (compute capability 8.0) or above. The feature allows memory compression for data with enough zero bytes without no loss of information. The device memory must be allocation with the CUDA driver APIs. A full example of Hardware Memory Compression applied to a cuSPARSE routine can be found in cuSPARSE Library Samples - Memory Compression.

## 3. cuSPARSE Indexing and Data Formats

The cuSPARSE library supports dense and sparse vector, and dense and sparse matrix formats.

### 3.1. Index Base Format

The library supports zero- and one-based indexing. The index base is selected through the cusparseIndexBase_t type, which is passed as a standalone parameter or as a field in the matrix descriptor cusparseMatDescr_t type.

### 3.1.1. Vector Formats

This section describes dense and sparse vector formats.

### 3.1.1.1. Dense Format

Dense vectors are represented with a single data array that is stored linearly in memory, such as the following $7×1$ dense vector.

 $\left[\begin{array}{ccccccc}1.0& 0.0& 0.0& 2.0& 3.0& 0.0& 4.0\end{array}\right]$

(This vector is referenced again in the next section.)

### 3.1.1.2. Sparse Format

Sparse vectors are represented with two arrays.

• The data array has the nonzero values from the equivalent array in dense format.

• The integer index array has the positions of the corresponding nonzero values in the equivalent array in dense format.

For example, the dense vector in section 3.2.1 can be stored as a sparse vector with one-based indexing.

 $\begin{array}{rcl}& & \left[\begin{array}{cccc}1.0& 2.0& 3.0& 4.0\end{array}\right]\\ & & \left[\begin{array}{cccc}1\phantom{.0}& 4\phantom{.0}& 5\phantom{.0}& 7\phantom{.0}\end{array}\right]\end{array}$

It can also be stored as a sparse vector with zero-based indexing.

 $\begin{array}{rcl}& & \left[\begin{array}{cccc}1.0& 2.0& 3.0& 4.0\end{array}\right]\\ & & \left[\begin{array}{cccc}0\phantom{.0}& 3\phantom{.0}& 4\phantom{.0}& 6\phantom{.0}\end{array}\right]\end{array}$

In each example, the top row is the data array and the bottom row is the index array, and it is assumed that the indices are provided in increasing order and that each index appears only once.

### 3.2. Matrix Formats

Dense and several sparse formats for matrices are discussed in this section.

### 3.2.1. Dense Format

The dense matrix X is assumed to be stored in column-major format in memory and is represented by the following parameters.

 m (integer) The number of rows in the matrix. n (integer) The number of columns in the matrix. ldX (integer) The leading dimension of X, which must be greater than or equal to m. If ldX is greater than m, then X represents a sub-matrix of a larger matrix stored in memory X (pointer) Points to the data array containing the matrix elements. It is assumed that enough storage is allocated for X to hold all of the matrix elements and that cuSPARSE library functions may access values outside of the sub-matrix, but will never overwrite them.

For example, m×n dense matrix X with leading dimension ldX can be stored with one-based indexing as shown.

 $\left[\begin{array}{cccc}{X}_{1,1}& {X}_{1,2}& \cdots & {X}_{1,n}\\ {X}_{2,1}& {X}_{2,2}& \cdots & {X}_{2,n}\\ ⋮& ⋮& \ddots & ⋮\\ {X}_{m,1}& {X}_{m,2}& \cdots & {X}_{m,n}\\ ⋮& ⋮& \ddots & ⋮\\ {X}_{ldX,1}& {X}_{ldX,2}& \cdots & {X}_{ldX,n}\end{array}\right]$

Its elements are arranged linearly in memory in the order below.

 $\left[\begin{array}{ccccccccccccc}{X}_{1,1}& {X}_{2,1}& \cdots & {X}_{m,1}& \cdots & {X}_{ldX,1}& \cdots & {X}_{1,n}& {X}_{2,n}& \cdots & {X}_{m,n}& \cdots & {X}_{ldX,n}\end{array}\right]$
Note: This format and notation are similar to those used in the NVIDIA CUDA cuBLAS library.

### 3.2.2. Coordinate Format (COO)

The m×n sparse matrix A is represented in COO format by the following parameters.

 nnz (integer) The number of nonzero elements in the matrix. cooValA (pointer) Points to the data array of length nnz that holds all nonzero values of A in row-major format. cooRowIndA (pointer) Points to the integer array of length nnz that contains the row indices of the corresponding elements in array cooValA. cooColIndA (pointer) Points to the integer array of length nnz that contains the column indices of the corresponding elements in array cooValA.

A sparse matrix in COO format is assumed to be stored in row-major format. Each COO entry consists of a row, column pair. The COO format is assumed to be sorted by row. Both sorted and unsorted column indices are supported.

For example, consider the following $4×5$ matrix A.

 $\left[\begin{array}{ccccc}1.0& 4.0& 0.0& 0.0& 0.0\\ 0.0& 2.0& 3.0& 0.0& 0.0\\ 5.0& 0.0& 0.0& 7.0& 8.0\\ 0.0& 0.0& 9.0& 0.0& 6.0\end{array}\right]$

It is stored in COO format with zero-based indexing this way.

 $\begin{array}{rcl}\text{cooValA}& =& \left[\begin{array}{ccccccccc}1.0& 4.0& 2.0& 3.0& 5.0& 7.0& 8.0& 9.0& 6.0\end{array}\right]\\ \text{cooRowIndA}& =& \left[\begin{array}{ccccccccc}0\phantom{.0}& 0\phantom{.0}& 1\phantom{.0}& 1\phantom{.0}& 2\phantom{.0}& 2\phantom{.0}& 2\phantom{.0}& 3\phantom{.0}& 3\phantom{.0}\end{array}\right]\\ \text{cooColIndA}& =& \left[\begin{array}{ccccccccc}0\phantom{.0}& 1\phantom{.0}& 1\phantom{.0}& 2\phantom{.0}& 0\phantom{.0}& 3\phantom{.0}& 4\phantom{.0}& 2\phantom{.0}& 4\phantom{.0}\end{array}\right]\end{array}$

In the COO format with one-based indexing, it is stored as shown.

 $\begin{array}{rcl}\text{cooValA}& =& \left[\begin{array}{ccccccccc}1.0& 4.0& 2.0& 3.0& 5.0& 7.0& 8.0& 9.0& 6.0\end{array}\right]\\ \text{cooRowIndA}& =& \left[\begin{array}{ccccccccc}1\phantom{.0}& 1\phantom{.0}& 2\phantom{.0}& 2\phantom{.0}& 3\phantom{.0}& 3\phantom{.0}& 3\phantom{.0}& 4\phantom{.0}& 4\phantom{.0}\end{array}\right]\\ \text{cooColIndA}& =& \left[\begin{array}{ccccccccc}1\phantom{.0}& 2\phantom{.0}& 2\phantom{.0}& 3\phantom{.0}& 1\phantom{.0}& 4\phantom{.0}& 5\phantom{.0}& 3\phantom{.0}& 5\phantom{.0}\end{array}\right]\end{array}$

### 3.2.3. Compressed Sparse Row Format (CSR)

The only way the CSR differs from the COO format is that the array containing the row indices is compressed in CSR format. The m×n sparse matrix A is represented in CSR format by the following parameters.

 nnz (integer) The number of nonzero elements in the matrix. csrValA (pointer) Points to the data array of length nnz that holds all nonzero values of A in row-major format. csrRowPtrA (pointer) Points to the integer array of length m+1 that holds indices into the arrays csrColIndA and csrValA. The first m entries of this array contain the indices of the first nonzero element in the ith row for i=i,...,m, while the last entry contains nnz+csrRowPtrA(0). In general, csrRowPtrA(0) is 0 or 1 for zero- and one-based indexing, respectively. csrColIndA (pointer) Points to the integer array of length nnz that contains the column indices of the corresponding elements in array csrValA.

Sparse matrices in CSR format are assumed to be stored in row-major CSR format. Both sorted and unsorted column indices are supported.

Consider again the $4×5$ matrixA.

 $\left[\begin{array}{ccccc}1.0& 4.0& 0.0& 0.0& 0.0\\ 0.0& 2.0& 3.0& 0.0& 0.0\\ 5.0& 0.0& 0.0& 7.0& 8.0\\ 0.0& 0.0& 9.0& 0.0& 6.0\end{array}\right]$

It is stored in CSR format with zero-based indexing as shown.

 $\begin{array}{rcl}\text{csrValA}& =& \left[\begin{array}{ccccccccc}1.0& 4.0& 2.0& 3.0& 5.0& 7.0& 8.0& 9.0& 6.0\end{array}\right]\\ \text{csrRowPtrA}& =& \left[\begin{array}{ccccc}0\phantom{.0}& 2\phantom{.0}& 4\phantom{.0}& 7\phantom{.0}& 9\phantom{.0}\end{array}\right]\\ \text{csrColIndA}& =& \left[\begin{array}{ccccccccc}0\phantom{.0}& 1\phantom{.0}& 1\phantom{.0}& 2\phantom{.0}& 0\phantom{.0}& 3\phantom{.0}& 4\phantom{.0}& 2\phantom{.0}& 4\phantom{.0}\end{array}\right]\end{array}$

This is how it is stored in CSR format with one-based indexing.

 $\begin{array}{rcl}\text{csrValA}& =& \left[\begin{array}{ccccccccc}1.0& 4.0& 2.0& 3.0& 5.0& 7.0& 8.0& 9.0& 6.0\end{array}\right]\\ \text{csrRowPtrA}& =& \left[\begin{array}{ccccc}1\phantom{.0}& 3\phantom{.0}& 5\phantom{.0}& 8\phantom{.0}& 10\phantom{.0}\end{array}\right]\\ \text{csrColIndA}& =& \left[\begin{array}{ccccccccc}1\phantom{.0}& 2\phantom{.0}& 2\phantom{.0}& 3\phantom{.0}& 1\phantom{.0}& 4\phantom{.0}& 5\phantom{.0}& 3\phantom{.0}& 5\phantom{.0}\end{array}\right]\end{array}$

### 3.2.4. Compressed Sparse Column Format (CSC)

The CSC format is different from the COO format in two ways: the matrix is stored in column-major format, and the array containing the column indices is compressed in CSC format. The m×n matrix A is represented in CSC format by the following parameters.

 nnz (integer) The number of nonzero elements in the matrix. cscValA (pointer) Points to the data array of length nnz that holds all nonzero values of A in column-major format. cscRowIndA (pointer) Points to the integer array of length nnz that contains the row indices of the corresponding elements in array cscValA. cscColPtrA (pointer) Points to the integer array of length n+1 that holds indices into the arrays cscRowIndA and cscValA. The first n entries of this array contain the indices of the first nonzero element in the ith row for i=i,...,n, while the last entry contains nnz+cscColPtrA(0). In general, cscColPtrA(0) is 0 or 1 for zero- and one-based indexing, respectively.
Note: The matrix A in CSR format has exactly the same memory layout as its transpose in CSC format (and vice versa).

For example, consider once again the $4×5$ matrix A.

 $\left[\begin{array}{ccccc}1.0& 4.0& 0.0& 0.0& 0.0\\ 0.0& 2.0& 3.0& 0.0& 0.0\\ 5.0& 0.0& 0.0& 7.0& 8.0\\ 0.0& 0.0& 9.0& 0.0& 6.0\end{array}\right]$

It is stored in CSC format with zero-based indexing this way.

 $\begin{array}{rcl}\text{cscValA}& =& \left[\begin{array}{ccccccccc}1.0& 5.0& 4.0& 2.0& 3.0& 9.0& 7.0& 8.0& 6.0\end{array}\right]\\ \text{cscRowIndA}& =& \left[\begin{array}{ccccccccc}0\phantom{.0}& 2\phantom{.0}& 0\phantom{.0}& 1\phantom{.0}& 1\phantom{.0}& 3\phantom{.0}& 2\phantom{.0}& 2\phantom{.0}& 3\phantom{.0}\end{array}\right]\\ \text{cscColPtrA}& =& \left[\begin{array}{cccccc}0\phantom{.0}& 2\phantom{.0}& 4\phantom{.0}& 6\phantom{.0}& 7\phantom{.0}& 9\phantom{.0}\end{array}\right]\end{array}$

In CSC format with one-based indexing, this is how it is stored.

 $\begin{array}{rcl}\text{cscValA}& =& \left[\begin{array}{ccccccccc}1.0& 5.0& 4.0& 2.0& 3.0& 9.0& 7.0& 8.0& 6.0\end{array}\right]\\ \text{cscRowIndA}& =& \left[\begin{array}{ccccccccc}1\phantom{.0}& 3\phantom{.0}& 1\phantom{.0}& 2\phantom{.0}& 2\phantom{.0}& 4\phantom{.0}& 3\phantom{.0}& 3\phantom{.0}& 4\phantom{.0}\end{array}\right]\\ \text{cscColPtrA}& =& \left[\begin{array}{cccccc}1\phantom{.0}& 3\phantom{.0}& 5\phantom{.0}& 7\phantom{.0}& 8\phantom{.0}& 10\phantom{.0}\end{array}\right]\end{array}$

Each pair of row and column indices appears only once.

### 3.2.5. Block Compressed Sparse Row Format (BSR)

The only difference between the CSR and BSR formats is the format of the storage element. The former stores primitive data types (single, double, cuComplex, and cuDoubleComplex) whereas the latter stores a two-dimensional square block of primitive data types. The dimension of the square block is $blockDim$ . The m×n sparse matrix A is equivalent to a block sparse matrix ${A}_{b}$ with $mb=\frac{m+blockDim-1}{blockDim}$ block rows and $nb=\frac{n+blockDim-1}{blockDim}$ block columns. If $m$ or $n$ is not multiple of $blockDim$ , then zeros are filled into ${A}_{b}$ .

A is represented in BSR format by the following parameters.

 blockDim (integer) Block dimension of matrix A. mb (integer) The number of block rows of A. nb (integer) The number of block columns of A. nnzb (integer) The number of nonzero blocks in the matrix. bsrValA (pointer) Points to the data array of length $nnzb\ast blockDi{m}^{2}$ that holds all elements of nonzero blocks of A. The block elements are stored in either column-major order or row-major order. bsrRowPtrA (pointer) Points to the integer array of length mb+1 that holds indices into the arrays bsrColIndA and bsrValA. The first mb entries of this array contain the indices of the first nonzero block in the ith block row for i=1,...,mb, while the last entry contains nnzb+bsrRowPtrA(0). In general, bsrRowPtrA(0) is 0 or 1 for zero- and one-based indexing, respectively. bsrColIndA (pointer) Points to the integer array of length nnzb that contains the column indices of the corresponding blocks in array bsrValA.

As with CSR format, (row, column) indices of BSR are stored in row-major order. The index arrays are first sorted by row indices and then within the same row by column indices.

For example, consider again the 4×5 matrix A.

 $\left[\begin{array}{ccccc}1.0& 4.0& 0.0& 0.0& 0.0\\ 0.0& 2.0& 3.0& 0.0& 0.0\\ 5.0& 0.0& 0.0& 7.0& 8.0\\ 0.0& 0.0& 9.0& 0.0& 6.0\end{array}\right]$

If $blockDim$ is equal to 2, then $mb$ is 2, $nb$ is 3, and matrix A is split into 2×3 block matrix ${A}_{b}$ . The dimension of ${A}_{b}$ is 4×6, slightly bigger than matrix $A$ , so zeros are filled in the last column of ${A}_{b}$ . The element-wise view of ${A}_{b}$ is this.

 $\left[\begin{array}{cccccc}1.0& 4.0& 0.0& 0.0& 0.0& 0.0\\ 0.0& 2.0& 3.0& 0.0& 0.0& 0.0\\ 5.0& 0.0& 0.0& 7.0& 8.0& 0.0\\ 0.0& 0.0& 9.0& 0.0& 6.0& 0.0\end{array}\right]$

Based on zero-based indexing, the block-wise view of ${A}_{b}$ can be represented as follows.

 ${A}_{b}=\left[\begin{array}{ccc}{A}_{00}& {A}_{01}& {A}_{02}\\ {A}_{10}& {A}_{11}& {A}_{12}\end{array}\right]$

The basic element of BSR is a nonzero ${A}_{ij}$ block, one that contains at least one nonzero element of A. Five of six blocks are nonzero in ${A}_{b}$ .

 ${A}_{00}=\left[\begin{array}{cc}1& 4\\ 0& 2\end{array}\right],{A}_{01}=\left[\begin{array}{cc}0& 0\\ 3& 0\end{array}\right],{A}_{10}=\left[\begin{array}{cc}5& 0\\ 0& 0\end{array}\right],{A}_{11}=\left[\begin{array}{cc}0& 7\\ 9& 0\end{array}\right],{A}_{12}=\left[\begin{array}{cc}8& 0\\ 6& 0\end{array}\right]$

BSR format only stores the information of nonzero blocks, including block indices $\left(i,j\right)$ and values ${A}_{ij}$ . Also row indices are compressed in CSR format.

 $\begin{array}{rcl}\text{bsrValA}& =& \left[\begin{array}{ccccc}{A}_{00}& {A}_{01}& {A}_{10}& {A}_{11}& {A}_{12}\end{array}\right]\\ \text{bsrRowPtrA}& =& \left[\begin{array}{ccc}0\phantom{.0}& 2\phantom{.0}& 5\end{array}\right]\\ \text{bsrColIndA}& =& \left[\begin{array}{ccccc}0\phantom{.0}& 1\phantom{.0}& 0\phantom{.0}& 1\phantom{.0}& 2\end{array}\right]\end{array}$

There are two ways to arrange the data element of block ${A}_{ij}$ : row-major order and column-major order. Under column-major order, the physical storage of bsrValA is this.

 $\begin{array}{ccccc}bsrValA=\left[1\phantom{.0}0\phantom{.0}4\phantom{.0}2\phantom{.0}& 0\phantom{.0}3\phantom{.0}0\phantom{.0}0\phantom{.0}& 5\phantom{.0}0\phantom{.0}0\phantom{.0}0\phantom{.0}& 0\phantom{.0}9\phantom{.0}7\phantom{.0}0\phantom{.0}& 8\phantom{.0}6\phantom{.0}0\phantom{.0}0\phantom{.0}\right]\end{array}$

Under row-major order, the physical storage of bsrValA is this.

 $\begin{array}{ccccc}bsrValA=\left[1\phantom{.0}4\phantom{.0}0\phantom{.0}2\phantom{.0}& 0\phantom{.0}0\phantom{.0}3\phantom{.0}0\phantom{.0}& 5\phantom{.0}0\phantom{.0}0\phantom{.0}0\phantom{.0}& 0\phantom{.0}7\phantom{.0}9\phantom{.0}0\phantom{.0}& 8\phantom{.0}0\phantom{.0}6\phantom{.0}0\phantom{.0}\right]\end{array}$

Similarly, in BSR format with one-based indexing and column-major order, A can be represented by the following.

 ${A}_{b}=\left[\begin{array}{ccc}{A}_{11}& {A}_{12}& {A}_{13}\\ {A}_{21}& {A}_{22}& {A}_{23}\end{array}\right]$
 $\begin{array}{ccccc}bsrValA=\left[1\phantom{.0}0\phantom{.0}4\phantom{.0}2\phantom{.0}& 0\phantom{.0}3\phantom{.0}0\phantom{.0}0\phantom{.0}& 5\phantom{.0}0\phantom{.0}0\phantom{.0}0\phantom{.0}& 0\phantom{.0}9\phantom{.0}7\phantom{.0}0\phantom{.0}& 8\phantom{.0}6\phantom{.0}0\phantom{.0}0\phantom{.0}\right]\end{array}$
 $\begin{array}{rcl}\text{bsrRowPtrA}& =& \left[\begin{array}{ccc}1\phantom{.0}& 3\phantom{.0}& 6\end{array}\right]\\ \text{bsrColIndA}& =& \left[\begin{array}{ccccc}1\phantom{.0}& 2\phantom{.0}& 1\phantom{.0}& 2\phantom{.0}& 3\end{array}\right]\end{array}$
Note: The general BSR format has two parameters, rowBlockDim and colBlockDim. rowBlockDim is number of rows within a block and colBlockDim is number of columns within a block. If rowBlockDim=colBlockDim, general BSR format is the same as BSR format. If rowBlockDim=colBlockDim=1, general BSR format is the same as CSR format. The conversion routine gebsr2gebsr is used to do conversion among CSR, BSR and general BSR.
Note: In the cuSPARSE Library, the storage format of blocks in BSR format can be column-major or row-major, independently of the base index. However, if the developer uses BSR format from the Math Kernel Library (MKL) and wants to directly interface with the cuSPARSE Library, then cusparseDirection_tCUSPARSE_DIRECTION_COLUMN should be used if the base index is one; otherwise, cusparseDirection_tCUSPARSE_DIRECTION_ROW should be used.

### 3.2.6. Extended BSR Format (BSRX)

BSRX is the same as the BSR format, but the array bsrRowPtrA is separated into two parts. The first nonzero block of each row is still specified by the array bsrRowPtrA, which is the same as in BSR, but the position next to the last nonzero block of each row is specified by the array bsrEndPtrA. Briefly, BSRX format is simply like a 4-vector variant of BSR format.

Matrix A is represented in BSRX format by the following parameters.

 blockDim (integer) Block dimension of matrix A. mb (integer) The number of block rows of A. nb (integer) The number of block columns of A. nnzb (integer) number of nonzero blocks in the matrix A. bsrValA (pointer) Points to the data array of length $nnzb\ast blockDi{m}^{2}$ that holds all the elements of the nonzero blocks of A. The block elements are stored in either column-major order or row-major order. bsrRowPtrA (pointer) Points to the integer array of length mb that holds indices into the arrays bsrColIndA and bsrValA; bsrRowPtrA(i) is the position of the first nonzero block of the ith block row in bsrColIndA and bsrValA. bsrEndPtrA (pointer) Points to the integer array of length mb that holds indices into the arrays bsrColIndA and bsrValA; bsrRowPtrA(i) is the position next to the last nonzero block of the ith block row in bsrColIndA and bsrValA. bsrColIndA (pointer) Points to the integer array of length nnzb that contains the column indices of the corresponding blocks in array bsrValA.

A simple conversion between BSR and BSRX can be done as follows. Suppose the developer has a 2×3 block sparse matrix ${A}_{b}$ represented as shown.

 ${A}_{b}=\left[\begin{array}{ccc}{A}_{00}& {A}_{01}& {A}_{02}\\ {A}_{10}& {A}_{11}& {A}_{12}\end{array}\right]$

Assume it has this BSR format.

 $\begin{array}{rcl}\text{bsrValA of BSR}& =& \left[\begin{array}{ccccc}{A}_{00}& {A}_{01}& {A}_{10}& {A}_{11}& {A}_{12}\end{array}\right]\\ \text{bsrRowPtrA of BSR}& =& \left[\begin{array}{ccc}0\phantom{.0}& 2\phantom{.0}& 5\end{array}\right]\\ \text{bsrColIndA of BSR}& =& \left[\begin{array}{ccccc}0\phantom{.0}& 1\phantom{.0}& 0\phantom{.0}& 1\phantom{.0}& 2\end{array}\right]\end{array}$

The bsrRowPtrA of the BSRX format is simply the first two elements of the bsrRowPtrA BSR format. The bsrEndPtrA of BSRX format is the last two elements of the bsrRowPtrA of BSR format.

 $\begin{array}{rcl}\text{bsrRowPtrA of BSRX}& =& \left[\begin{array}{cc}0\phantom{.0}& 2\end{array}\right]\\ \text{bsrEndPtrA of BSRX}& =& \left[\begin{array}{cc}2\phantom{.0}& 5\end{array}\right]\end{array}$

The advantage of the BSRX format is that the developer can specify a submatrix in the original BSR format by modifying bsrRowPtrA and bsrEndPtrA while keeping bsrColIndA and bsrValA unchanged.

For example, to create another block matrix $\stackrel{˜}{A}=\left[\begin{array}{ccc}O& O& O\\ O& {A}_{11}& O\end{array}\right]$ that is slightly different from $A$ , the developer can keep bsrColIndA and bsrValA, but reconstruct $\stackrel{˜}{A}$ by properly setting of bsrRowPtrA and bsrEndPtrA. The following 4-vector characterizes $\stackrel{˜}{A}$ .

## 4. cuSPARSE Types Reference

### 4.1. Data types

The float, double, cuComplex, and cuDoubleComplex data types are supported. The first two are standard C data types, while the last two are exported from cuComplex.h.

### 4.2. cusparseStatus_t

This data type represents the status returned by the library functions and it can have the following values

Value Description
CUSPARSE_STATUS_SUCCESS

The operation completed successfully

CUSPARSE_STATUS_NOT_INITIALIZED

The cuSPARSE library was not initialized. This is usually caused by the lack of a prior call, an error in the CUDA Runtime API called by the cuSPARSE routine, or an error in the hardware setup

To correct: call cusparseCreate() prior to the function call; and check that the hardware, an appropriate version of the driver, and the cuSPARSE library are correctly installed

The error also applies to generic APIs ( Generic APIs reference) for indicating a matrix/vector descriptor not initialized

CUSPARSE_STATUS_ALLOC_FAILED

Resource allocation failed inside the cuSPARSE library. This is usually caused by a device memory allocation (cudaMalloc()) or by a host memory allocation failure

To correct: prior to the function call, deallocate previously allocated memory as much as possible

CUSPARSE_STATUS_INVALID_VALUE

An unsupported value or parameter was passed to the function (a negative vector size, for example)

To correct: ensure that all the parameters being passed have valid values

CUSPARSE_STATUS_ARCH_MISMATCH

The function requires a feature absent from the device architecture

To correct: compile and run the application on a device with appropriate compute capability

CUSPARSE_STATUS_EXECUTION_FAILED

The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons

To correct: check that the hardware, an appropriate version of the driver, and the cuSPARSE library are correctly installed

CUSPARSE_STATUS_INTERNAL_ERROR

An internal cuSPARSE operation failed

To correct: check that the hardware, an appropriate version of the driver, and the cuSPARSE library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine completion

CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED

The matrix type is not supported by this function. This is usually caused by passing an invalid matrix descriptor to the function

To correct: check that the fields in cusparseMatDescr_t descrA were set correctly

CUSPARSE_STATUS_NOT_SUPPORTED

The operation or data type combination is currently not supported by the function

CUSPARSE_STATUS_INSUFFICIENT_RESOURCES

The resources for the computation, such as GPU global or shared memory, are not sufficient to complete the operation. The error can also indicate that the current computation mode (e.g. bit size of sparse matrix indices) does not allow to handle the given input

### 4.3. cusparseHandle_t

This is a pointer type to an opaque cuSPARSE context, which the user must initialize by calling prior to calling cusparseCreate() any other library function. The handle created and returned by cusparseCreate() must be passed to every cuSPARSE function.

### 4.4. cusparsePointerMode_t

This type indicates whether the scalar values are passed by reference on the host or device. It is important to point out that if several scalar values are passed by reference in the function call, all of them will conform to the same single pointer mode. The pointer mode can be set and retrieved using cusparseSetPointerMode() and cusparseGetPointerMode() routines, respectively.

Value Meaning

CUSPARSE_POINTER_MODE_HOST

the scalars are passed by reference on the host.

CUSPARSE_POINTER_MODE_DEVICE

the scalars are passed by reference on the device.

### 4.5. cusparseOperation_t

This type indicates which operations need to be performed with the sparse matrix.

Value Meaning

CUSPARSE_OPERATION_NON_TRANSPOSE

the non-transpose operation is selected.

CUSPARSE_OPERATION_TRANSPOSE

the transpose operation is selected.

CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE

the conjugate transpose operation is selected.

### 4.6. cusparseAction_t

This type indicates whether the operation is performed only on indices or on data and indices.

Value Meaning

CUSPARSE_ACTION_SYMBOLIC

the operation is performed only on indices.

CUSPARSE_ACTION_NUMERIC

the operation is performed on data and indices.

### 4.7. cusparseDirection_t

This type indicates whether the elements of a dense matrix should be parsed by rows or by columns (assuming column-major storage in memory of the dense matrix) in function cusparse[S|D|C|Z]nnz. Besides storage format of blocks in BSR format is also controlled by this type.

Value Meaning

CUSPARSE_DIRECTION_ROW

the matrix should be parsed by rows.

CUSPARSE_DIRECTION_COLUMN

the matrix should be parsed by columns.

### 4.8. cusparseMatDescr_t

This structure is used to describe the shape and properties of a matrix.

typedef struct {
cusparseMatrixType_t MatrixType;
cusparseFillMode_t FillMode;
cusparseDiagType_t DiagType;
cusparseIndexBase_t IndexBase;
} cusparseMatDescr_t;

### 4.8.1. cusparseDiagType_t

This type indicates if the matrix diagonal entries are unity. The diagonal elements are always assumed to be present, but if CUSPARSE_DIAG_TYPE_UNIT is passed to an API routine, then the routine assumes that all diagonal entries are unity and will not read or modify those entries. Note that in this case the routine assumes the diagonal entries are equal to one, regardless of what those entries are actually set to in memory.

Value Meaning

CUSPARSE_DIAG_TYPE_NON_UNIT

the matrix diagonal has non-unit elements.

CUSPARSE_DIAG_TYPE_UNIT

the matrix diagonal has unit elements.

### 4.8.2. cusparseFillMode_t

This type indicates if the lower or upper part of a matrix is stored in sparse storage.

Value Meaning

CUSPARSE_FILL_MODE_LOWER

the lower triangular part is stored.

CUSPARSE_FILL_MODE_UPPER

the upper triangular part is stored.

### 4.8.3. cusparseIndexBase_t

This type indicates if the base of the matrix indices is zero or one.

Value Meaning

CUSPARSE_INDEX_BASE_ZERO

the base index is zero.

CUSPARSE_INDEX_BASE_ONE

the base index is one.

### 4.8.4. cusparseMatrixType_t

This type indicates the type of matrix stored in sparse storage. Notice that for symmetric, Hermitian and triangular matrices only their lower or upper part is assumed to be stored.

The whole idea of matrix type and fill mode is to keep minimum storage for symmetric/Hermitian matrix, and also to take advantage of symmetric property on SpMV (Sparse Matrix Vector multiplication). To compute y=A*x when A is symmetric and only lower triangular part is stored, two steps are needed. First step is to compute y=(L+D)*x and second step is to compute y=L^T*x + y. Given the fact that the transpose operation y=L^T*x is 10x slower than non-transpose version y=L*x, the symmetric property does not show up any performance gain. It is better for the user to extend the symmetric matrix to a general matrix and apply y=A*x with matrix type CUSPARSE_MATRIX_TYPE_GENERAL.

In general, SpMV, preconditioners (incomplete Cholesky or incomplete LU) and triangular solver are combined together in iterative solvers, for example PCG and GMRES. If the user always uses general matrix (instead of symmetric matrix), there is no need to support other than general matrix in preconditioners. Therefore the new routines, [bsr|csr]sv2 (triangular solver), [bsr|csr]ilu02 (incomplete LU) and [bsr|csr]ic02 (incomplete Cholesky), only support matrix type CUSPARSE_MATRIX_TYPE_GENERAL.

Value Meaning

CUSPARSE_MATRIX_TYPE_GENERAL

the matrix is general.

CUSPARSE_MATRIX_TYPE_SYMMETRIC

the matrix is symmetric.

CUSPARSE_MATRIX_TYPE_HERMITIAN

the matrix is Hermitian.

CUSPARSE_MATRIX_TYPE_TRIANGULAR

the matrix is triangular.

### 4.9. cusparseAlgMode_t [DEPRECATED]

[[DEPRECATED]] The routine will be removed in the next major release

This is type for algorithm parameter to cusparseCsrmvEx() and cusparseCsrmvEx_bufferSize() functions.

Value Meaning

CUSPARSE_ALG_MERGE_PATH

Use load-balancing algorithm that suits better for irregular nonzero-patterns.

### 4.10. cusparseColorInfo_t

This is a pointer type to an opaque structure holding the information used in csrcolor().

### 4.11. cusparseSolvePolicy_t

This type indicates whether level information is generated and used in csrsv2, csric02, csrilu02, bsrsv2, bsric02 and bsrilu02.

Value Meaning

CUSPARSE_SOLVE_POLICY_NO_LEVEL

no level information is generated and used.

CUSPARSE_SOLVE_POLICY_USE_LEVEL

generate and use level information.

### 4.12. bsric02Info_t

This is a pointer type to an opaque structure holding the information used in bsric02_bufferSize(), bsric02_analysis(), and bsric02().

### 4.13. bsrilu02Info_t

This is a pointer type to an opaque structure holding the information used in bsrilu02_bufferSize(), bsrilu02_analysis(), and bsrilu02().

### 4.14. bsrsm2Info_t

This is a pointer type to an opaque structure holding the information used in bsrsm2_bufferSize(), bsrsm2_analysis(), and bsrsm2_solve().

### 4.15. bsrsv2Info_t

This is a pointer type to an opaque structure holding the information used in bsrsv2_bufferSize(), bsrsv2_analysis(), and bsrsv2_solve().

### 4.16. csrgemm2Info_t [DEPRECATED]

[[DEPRECATED]] The routine will be removed in the next major release

This is a pointer type to an opaque structure holding the information used in csrgemm2_bufferSizeExt(), and csrgemm2().

### 4.17. csric02Info_t

This is a pointer type to an opaque structure holding the information used in csric02_bufferSize(), csric02_analysis(), and csric02().

### 4.18. csrilu02Info_t

This is a pointer type to an opaque structure holding the information used in csrilu02_bufferSize(), csrilu02_analysis(), and csrilu02().

### 4.19. csrsm2Info_t [DEPRECATED]

[[DEPRECATED]] The routine will be removed in the next major release

This is a pointer type to an opaque structure holding the information used in csrsm2_bufferSize(), csrsm2_analysis(), and csrsm2_solve().

### 4.20. csrsv2Info_t [DEPRECATED]

[[DEPRECATED]] The routine will be removed in the next major release

This is a pointer type to an opaque structure holding the information used in csrsv2_bufferSize(), csrsv2_analysis(), and csrsv2_solve().

## 5. cuSPARSE Management Function Reference

The cuSPARSE functions for managing the library are described in this section.

### 5.1. cusparseCreate()

cusparseStatus_t
cusparseCreate(cusparseHandle_t *handle)


This function initializes the cuSPARSE library and creates a handle on the cuSPARSE context. It must be called before any other cuSPARSE API function is invoked. It allocates hardware resources necessary for accessing the GPU.

Param. In/out Meaning
handle IN The pointer to the handle to the cuSPARSE context

See cusparseStatus_t for the description of the return status

### 5.2. cusparseDestroy()

cusparseStatus_t
cusparseDestroy(cusparseHandle_t handle)


This function releases CPU-side resources used by the cuSPARSE library. The release of GPU-side resources may be deferred until the application shuts down.

Param. In/out Meaning
handle IN The handle to the cuSPARSE context

See cusparseStatus_t for the description of the return status

### 5.3. cusparseGetErrorName()

const char*
cusparseGetErrorString(cusparseStatus_t status)


The function returns the string representation of an error code enum name. If the error code is not recognized, "unrecognized error code" is returned.

Param. In/out Meaning
status IN Error code to convert to string
const char* OUT Pointer to a NULL-terminated string

### 5.4. cusparseGetErrorString()

const char*
cusparseGetErrorString(cusparseStatus_t status)


Returns the description string for an error code. If the error code is not recognized, "unrecognized error code" is returned.

Param. In/out Meaning
status IN Error code to convert to string
const char* OUT Pointer to a NULL-terminated string

### 5.5. cusparseGetProperty()

cusparseStatus_t
cusparseGetProperty(libraryPropertyType type,
int*                value)


The function returns the value of the requested property. Refer to libraryPropertyType for supported types.

Param. In/out Meaning
type IN Requested property
value OUT Value of the requested property

libraryPropertyType (defined in library_types.h):

Value Meaning
MAJOR_VERSION Enumerator to query the major version
MINOR_VERSION Enumerator to query the minor version
PATCH_LEVEL Number to identify the patch level

See cusparseStatus_t for the description of the return status

### 5.6. cusparseGetVersion()

cusparseStatus_t
cusparseGetVersion(cusparseHandle_t handle,
int*             version)


This function returns the version number of the cuSPARSE library.

Param. In/out Meaning
handle IN cuSPARSE handle
version OUT The version number of the library

See cusparseStatus_t for the description of the return status

### 5.7. cusparseGetPointerMode()

cusparseStatus_t
cusparseGetPointerMode(cusparseHandlet handle,
cusparsePointerMode_t *mode)


This function obtains the pointer mode used by the cuSPARSE library. Please see the section on the cusparsePointerMode_t type for more details.

Param. In/out Meaning
handle IN The handle to the cuSPARSE context
mode OUT One of the enumerated pointer mode types

See cusparseStatus_t for the description of the return status

### 5.8. cusparseSetPointerMode()

cusparseStatus_t
cusparseSetPointerMode(cusparseHandle_t handle,
cusparsePointerMode_t mode)


This function sets the pointer mode used by the cuSPARSE library. The default is for the values to be passed by reference on the host. Please see the section on the cublasPointerMode_t type for more details.

Param. In/out Meaning
handle IN The handle to the cuSPARSE context
mode IN One of the enumerated pointer mode types

See cusparseStatus_t for the description of the return status

### 5.9. cusparseGetStream()

cusparseStatus_t
cusparseGetStream(cusparseHandle_t handle, cudaStream_t *streamId)


This function gets the cuSPARSE library stream, which is being used to to execute all calls to the cuSPARSE library functions. If the cuSPARSE library stream is not set, all kernels use the default NULL stream.

Param. In/out Meaning
handle IN The handle to the cuSPARSE context
streamId OUT The stream used by the library

See cusparseStatus_t for the description of the return status

### 5.10. cusparseSetStream()

cusparseStatus_t
cusparseSetStream(cusparseHandle_t handle, cudaStream_t streamId)


This function sets the stream to be used by the cuSPARSE library to execute its routines.

Param. In/out Meaning
handle IN The handle to the cuSPARSE context
streamId IN The stream to be used by the library

See cusparseStatus_t for the description of the return status

## 6. cuSPARSE Logging

cuSPARSE logging mechanism can be enabled by setting the following environment variables before launching the target application:

cuSPARSE_LOG_LEVEL=<level> - while level is one of the following levels:
• 0 - Off - logging is disabled (default)
• 1 - Error - only errors will be logged
• 2 - Trace - API calls that launch CUDA kernels will log their parameters and important information
• 3 - Hints - hints that can potentially improve the application's performance
• 4 - Info - provides general information about the library execution, may contain details about heuristic status
• 5 - API Trace - API calls will log their parameter and important information
• 0 - Off
• 1 - Error
• 2 - Trace
• 4 - Hints
• 8 - Info
• 16 - API Trace

cuSPARSE_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may contain %i, that will be replaced with the process id. E.g "<file_name>_%i.log".

If CUSPARSE_LOG_FILE is not defined, the log messages are printed to stdout.

Another option is to use the experimental cuSPARSE logging API. See:

Note: The logging mechanism is not available for the legacy APIs.

### 6.1. cusparseLoggerSetCallback()

cusparseStatus_t
cusparseLoggerSetCallback(cusparseLoggerCallback_t callback)


Experimental: The function sets the logging callback function.

Param. In/out Meaning
callback IN Pointer to a callback function
where cusparseLoggerCallback_t has the following signature:
void (*cusparseLoggerCallback_t)(int         logLevel,
const char* functionName,
const char* message)

Param. In/out Meaning
logLevel IN Selected log level
functionName IN The name of the API that logged this message
message IN The log message

See cusparseStatus_t for the description of the return status

### 6.2. cusparseLoggerSetFile()

cusparseStatus_t
cusparseLoggerSetFile(FILE* file)


Experimental: The function sets the logging output file. Note: once registered using this function call, the provided file handle must not be closed unless the function is called again to switch to a different file handle.

Param. In/out Meaning
file IN Pointer to an open file. File should have write permission

See cusparseStatus_t for the description of the return status

### 6.3. cusparseLoggerOpenFile()

cusparseStatus_t
cusparseLoggerOpenFile(const char* logFile)


Experimental: The function opens a logging output file in the given path.

Param. In/out Meaning
logFile IN Path of the logging output file

See cusparseStatus_t for the description of the return status

### 6.4. cusparseLoggerSetLevel()

cusparseStatus_t
cusparseLoggerSetLevel(int level)


Experimental: The function sets the value of the logging level. path.

Param. In/out Meaning
level IN Value of the logging level

See cusparseStatus_t for the description of the return status

cusparseStatus_t


Experimental: The function sets the value of the logging mask.

Param. In/out Meaning

See cusparseStatus_t for the description of the return status

### 6.6. cublasLtLoggerForceDisable()

cusparseStatus_t
cublasLtLoggerForceDisable()


Experimental: The function disables logging for the entier run.

See cusparseStatus_t for the description of the return status

## 7. cuSPARSE Helper Function Reference

The cuSPARSE helper functions are described in this section.

### 7.1. cusparseCreateColorInfo()

cusparseStatus_t
cusparseCreateColorInfo(cusparseColorInfo_t* info)


This function creates and initializes the cusparseColorInfo_t structure to default values.

Input
 info the pointer to the cusparseColorInfo_t structure

See cusparseStatus_t for the description of the return status

### 7.2. cusparseCreateMatDescr()

cusparseStatus_t
cusparseCreateMatDescr(cusparseMatDescr_t *descrA)

This function initializes the matrix descriptor. It sets the fields MatrixType and IndexBase to the default values CUSPARSE_MATRIX_TYPE_GENERAL and CUSPARSE_INDEX_BASE_ZERO , respectively, while leaving other fields uninitialized.

Input
 descrA the pointer to the matrix descriptor.

See cusparseStatus_t for the description of the return status

### 7.3. cusparseDestroyColorInfo()

cusparseStatus_t
cusparseDestroyColorInfo(cusparseColorInfo_t info)


This function destroys and releases any memory required by the structure.

Input

 info the pointer to the structure of csrcolor()

See cusparseStatus_t for the description of the return status

### 7.4. cusparseDestroyMatDescr()

cusparseStatus_t
cusparseDestroyMatDescr(cusparseMatDescr_t descrA)


This function releases the memory allocated for the matrix descriptor.

Input
 descrA the matrix descriptor.

See cusparseStatus_t for the description of the return status

### 7.5. cusparseGetMatDiagType()

cusparseDiagType_t
cusparseGetMatDiagType(const cusparseMatDescr_t descrA)


This function returns the DiagType field of the matrix descriptor descrA.

Input
 descrA the matrix descriptor.
Returned
 One of the enumerated diagType types.

### 7.6. cusparseGetMatFillMode()

cusparseFillMode_t
cusparseGetMatFillMode(const cusparseMatDescr_t descrA)


This function returns the FillMode field of the matrix descriptor descrA.

Input
 descrA the matrix descriptor.
Returned
 One of the enumerated fillMode types.

### 7.7. cusparseGetMatIndexBase()

cusparseIndexBase_t
cusparseGetMatIndexBase(const cusparseMatDescr_t descrA)


This function returns the IndexBase field of the matrix descriptor descrA.

Input
 descrA the matrix descriptor.
Returned
 One of the enumerated indexBase types.

### 7.8. cusparseGetMatType()

cusparseMatrixType_t
cusparseGetMatType(const cusparseMatDescr_t descrA)


This function returns the MatrixType field of the matrix descriptor descrA.

Input
 descrA the matrix descriptor.
Returned
 One of the enumerated matrix types.

### 7.9. cusparseSetMatDiagType()

cusparseStatus_t
cusparseSetMatDiagType(cusparseMatDescr_t descrA,
cusparseDiagType_t diagType)


This function sets the DiagType field of the matrix descriptor descrA.

Input
 diagType One of the enumerated diagType types.
Output
 descrA the matrix descriptor.

See cusparseStatus_t for the description of the return status

### 7.10. cusparseSetMatFillMode()

cusparseStatus_t
cusparseSetMatFillMode(cusparseMatDescr_t descrA,
cusparseFillMode_t fillMode)


This function sets the FillMode field of the matrix descriptor descrA.

Input
 fillMode One of the enumerated fillMode types.
Output
 descrA the matrix descriptor.

See cusparseStatus_t for the description of the return status

### 7.11. cusparseSetMatIndexBase()

cusparseStatus_t
cusparseSetMatIndexBase(cusparseMatDescr_t descrA,
cusparseIndexBase_t base)


This function sets the IndexBase field of the matrix descriptor descrA.

Input
 base One of the enumerated indexBase types.
Output
 descrA the matrix descriptor.

See cusparseStatus_t for the description of the return status

### 7.12. cusparseSetMatType()

cusparseStatus_t
cusparseSetMatType(cusparseMatDescr_t descrA, cusparseMatrixType_t type)


This function sets the MatrixType field of the matrix descriptor descrA.

Input
 type One of the enumerated matrix types.
Output
 descrA the matrix descriptor.

See cusparseStatus_t for the description of the return status

### 7.13. cusparseCreateCsrsv2Info() [DEPRECATED]

[[DEPRECATED]] The routine will be removed in the next major release

cusparseStatus_t
cusparseCreateCsrsv2Info(csrsv2Info_t *info);


This function creates and initializes the solve and analysis structure of csrsv2 to default values.

Input
 info the pointer to the solve and analysis structure of csrsv2.

See cusparseStatus_t for the description of the return status

### 7.14. cusparseDestroyCsrsv2Info() [DEPRECATED]

[[DEPRECATED]] The routine will be removed in the next major release

cusparseStatus_t
cusparseDestroyCsrsv2Info(csrsv2Info_t info);


This function destroys and releases any memory required by the structure.

Input

 info the solve (csrsv2_solve) and analysis (csrsv2_analysis) structure.

See cusparseStatus_t for the description of the return status

### 7.15. cusparseCreateCsrsm2Info() [DEPRECATED]

[[DEPRECATED]] The routine will be removed in the next major release

cusparseStatus_t
cusparseCreateCsrsm2Info(csrsm2Info_t *info);


This function creates and initializes the solve and analysis structure of csrsm2 to default values.

Input
 info the pointer to the solve and analysis structure of csrsm2.

See cusparseStatus_t for the description of the return status

### 7.16. cusparseDestroyCsrsm2Info() [DEPRECATED]

[[DEPRECATED]] The routine will be removed in the next major release

cusparseStatus_t
cusparseDestroyCsrsm2Info(csrsm2Info_t info);


This function destroys and releases any memory required by the structure.

Input

 info the solve (csrsm2_solve) and analysis (csrsm2_analysis) structure.

See cusparseStatus_t for the description of the return status

### 7.17. cusparseCreateCsric02Info()

cusparseStatus_t
cusparseCreateCsric02Info(csric02Info_t *info);


This function creates and initializes the solve and analysis structure of incomplete Cholesky to default values.

Input
 info the pointer to the solve and analysis structure of incomplete Cholesky.

See cusparseStatus_t for the description of the return status

### 7.18. cusparseDestroyCsric02Info()

cusparseStatus_t
cusparseDestroyCsric02Info(csric02Info_t info);


This function destroys and releases any memory required by the structure.

Input

 info the solve (csric02_solve) and analysis (csric02_analysis) structure.

See cusparseStatus_t for the description of the return status

### 7.19. cusparseCreateCsrilu02Info()

cusparseStatus_t
cusparseCreateCsrilu02Info(csrilu02Info_t *info);


This function creates and initializes the solve and analysis structure of incomplete LU to default values.

Input
 info the pointer to the solve and analysis structure of incomplete LU.

See cusparseStatus_t for the description of the return status

### 7.20. cusparseDestroyCsrilu02Info()

cusparseStatus_t
cusparseDestroyCsrilu02Info(csrilu02Info_t info);


This function destroys and releases any memory required by the structure.

Input

 info the solve (csrilu02_solve) and analysis (csrilu02_analysis) structure.

See cusparseStatus_t for the description of the return status

### 7.21. cusparseCreateBsrsv2Info()

cusparseStatus_t
cusparseCreateBsrsv2Info(bsrsv2Info_t *info);


This function creates and initializes the solve and analysis structure of bsrsv2 to default values.

Input
 info the pointer to the solve and analysis structure of bsrsv2.

See cusparseStatus_t for the description of the return status

### 7.22. cusparseDestroyBsrsv2Info()

cusparseStatus_t
cusparseDestroyBsrsv2Info(bsrsv2Info_t info);


This function destroys and releases any memory required by the structure.

Input

 info the solve (bsrsv2_solve) and analysis (bsrsv2_analysis) structure.

See cusparseStatus_t for the description of the return status

### 7.23. cusparseCreateBsrsm2Info()

cusparseStatus_t
cusparseCreateBsrsm2Info(bsrsm2Info_t *info);


This function creates and initializes the solve and analysis structure of bsrsm2 to default values.

Input
 info the pointer to the solve and analysis structure of bsrsm2.

See cusparseStatus_t for the description of the return status

### 7.24. cusparseDestroyBsrsm2Info()

cusparseStatus_t
cusparseDestroyBsrsm2Info(bsrsm2Info_t info);


This function destroys and releases any memory required by the structure.

Input

 info the solve (bsrsm2_solve) and analysis (bsrsm2_analysis) structure.

See cusparseStatus_t for the description of the return status

### 7.25. cusparseCreateBsric02Info()

cusparseStatus_t
cusparseCreateBsric02Info(bsric02Info_t *info);


This function creates and initializes the solve and analysis structure of block incomplete Cholesky to default values.

Input
 info the pointer to the solve and analysis structure of block incomplete Cholesky.

See cusparseStatus_t for the description of the return status

### 7.26. cusparseDestroyBsric02Info()

cusparseStatus_t
cusparseDestroyBsric02Info(bsric02Info_t info);


This function destroys and releases any memory required by the structure.

Input

 info the solve (bsric02_solve) and analysis (bsric02_analysis) structure.

See cusparseStatus_t for the description of the return status

### 7.27. cusparseCreateBsrilu02Info()

cusparseStatus_t
cusparseCreateBsrilu02Info(bsrilu02Info_t *info);


This function creates and initializes the solve and analysis structure of block incomplete LU to default values.

Input
 info the pointer to the solve and analysis structure of block incomplete LU.

See cusparseStatus_t for the description of the return status

### 7.28. cusparseDestroyBsrilu02Info()

cusparseStatus_t
cusparseDestroyBsrilu02Info(bsrilu02Info_t info);


This function destroys and releases any memory required by the structure.

Input

 info the solve (bsrilu02_solve) and analysis (bsrilu02_analysis) structure.

See cusparseStatus_t for the description of the return status

### 7.29. cusparseCreateCsrgemm2Info() [DEPRECATED]

[[DEPRECATED]] The routine will be removed in the next major release

cusparseStatus_t
cusparseCreateCsrgemm2Info(csrgemm2Info_t *info);


This function creates and initializes analysis structure of general sparse matrix-matrix multiplication.

Input
 info the pointer to the analysis structure of general sparse matrix-matrix multiplication.

See cusparseStatus_t for the description of the return status

### 7.30. cusparseDestroyCsrgemm2Info() [DEPRECATED]

[[DEPRECATED]] The routine will be removed in the next major release

cusparseStatus_t
cusparseDestroyCsrgemm2Info(csrgemm2Info_t info);


This function destroys and releases any memory required by the structure.

Input

 info opaque structure of csrgemm2.

See cusparseStatus_t for the description of the return status

### 7.31. cusparseCreatePruneInfo()

cusparseStatus_t
cusparseCreatePruneInfo(pruneInfo_t *info);


This function creates and initializes structure of prune to default values.

Input
 info the pointer to the structure of prune.

See cusparseStatus_t for the description of the return status

### 7.32. cusparseDestroyPruneInfo()

cusparseStatus_t
cusparseDestroyPruneInfo(pruneInfo_t info);


This function destroys and releases any memory required by the structure.

Input

 info the structure of prune.

See cusparseStatus_t for the description of the return status

## 8. cuSPARSE Level 1 Function Reference

This chapter describes sparse linear algebra functions that perform operations between dense and sparse vectors.

### 8.1. cusparse<t>axpyi() [DEPRECATED]

[[DEPRECATED]] use cusparseAxpby() instead. The routine will be removed in the next major release

usparseStatus_t
cusparseSaxpyi(cusparseHandle_t    handle,
int                 nnz,
const float*        alpha,
const float*        xVal,
const int*          xInd,
float*              y,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseDaxpyi(cusparseHandle_t    handle,
int                 nnz,
const double*       alpha,
const double*       xVal,
const int*          xInd,
double*             y,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseCaxpyi(cusparseHandle_t    handle,
int                 nnz,
const cuComplex*    alpha,
const cuComplex*    xVal,
const int*          xInd,
cuComplex*          y,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseZaxpyi(cusparseHandle_t       handle,
int                    nnz,
const cuDoubleComplex* alpha,
const cuDoubleComplex* xVal,
const int*             xInd,
cuDoubleComplex*       y,
cusparseIndexBase_t    idxBase)


This function multiplies the vector x in sparse format by the constant $\alpha$ and adds the result to the vector y in dense format. This operation can be written as

 $y=y+\alpha \ast x$

In other words,

for i=0 to nnz-1
y[xInd[i]-idxBase] = y[xInd[i]-idxBase] + alpha*xVal[i]
• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
Input
 handle handle to the cuSPARSE library context. nnz number of elements in vector x. alpha scalar used for multiplication. xVal vector with nnz nonzero values of vector x. xInd integer vector with nnz indices of the nonzero values of vector x. y vector in dense format. idxBase CUSPARSE_INDEX_BASE_ZERO or CUSPARSE_INDEX_BASE_ONE.
Output
 y updated vector in dense format (that is unchanged if nnz == 0).

See cusparseStatus_t for the description of the return status

### 8.2. cusparse<t>gthr() [DEPRECATED]

[[DEPRECATED]] use cusparseGather() instead. The routine will be removed in the next major release

cusparseStatus_t
cusparseSgthr(cusparseHandle_t    handle,
int                 nnz,
const float*        y,
float*              xVal,
const int*          xInd,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseDgthr(cusparseHandle_t    handle,
int                 nnz,
const double*       y,
double*             xVal,
const int*          xInd,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseCgthr(cusparseHandle_t    handle,
int                 nnz,
const cuComplex*    y,
cuComplex*          xVal,
const int*          xInd,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseZgthr(cusparseHandle_t        handle,
int                     nnz,
const cuDoubleComplex*  y,
cuDoubleComplex*        xVal,
const int*              xInd,
cusparseIndexBase_t     idxBase)


This function gathers the elements of the vector y listed in the index array xInd into the data array xVal.

• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
Input
 handle handle to the cuSPARSE library context. nnz number of elements in vector x. y vector in dense format (of size≥max(xInd)-idxBase+1). xInd integer vector with nnz indices of the nonzero values of vector x. idxBase CUSPARSE_INDEX_BASE_ZERO or CUSPARSE_INDEX_BASE_ONE.
Output
 xVal vector with nnz nonzero values that were gathered from vector y (that is unchanged if nnz == 0).

See cusparseStatus_t for the description of the return status

### 8.3. cusparse<t>gthrz() [DEPRECATED]

[[DEPRECATED]] use cusparseGather() instead. The routine will be removed in the next major release

cusparseStatus_t
cusparseSgthrz(cusparseHandle_t    handle,
int                 nnz,
float*              y,
float*              xVal,
const int*          xInd,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseDgthrz(cusparseHandle_t    handle,
int                 nnz,
double*             y,
double*             xVal,
const int*          xInd,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseCgthrz(cusparseHandle_t    handle,
int                 nnz,
cuComplex*          y,
cuComplex*          xVal,
const int*          xInd,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseZgthrz(cusparseHandle_t    handle,
int                 nnz,
cuDoubleComplex*    y,
cuDoubleComplex*    xVal,
const int*          xInd,
cusparseIndexBase_t idxBase)


This function gathers the elements of the vector y listed in the index array xInd into the data array xVal. Also, it zeros out the gathered elements in the vector y.

• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
Input
 handle handle to the cuSPARSE library context. nnz number of elements in vector x. y vector in dense format (of size≥max(xInd)-idxBase+1). xInd integer vector with nnz indices of the nonzero values of vector x. idxBase CUSPARSE_INDEX_BASE_ZERO or CUSPARSE_INDEX_BASE_ONE.
Output
 xVal vector with nnz nonzero values that were gathered from vector y (that is unchanged if nnz == 0). y vector in dense format with elements indexed by xInd set to zero (it is unchanged if nnz == 0).

See cusparseStatus_t for the description of the return status

### 8.4. cusparse<t>roti() [DEPRECATED]

[[DEPRECATED]] use cusparseRot() instead. The routine will be removed in the next major release

cusparseStatus_t
cusparseSroti(cusparseHandle_t    handle,
int                 nnz,
float*              xVal,
const int*          xInd,
float*              y,
const float*        c,
const float*        s,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseDroti(cusparseHandle_t    handle,
int                 nnz,
double*             xVal,
const int*          xInd,
double*             y,
const double*       c,
const double*       s,
cusparseIndexBase_t idxBase)


This function applies the Givens rotation matrix

 $G=\left(\begin{array}{rr}c& s\\ -s& c\end{array}\right)$

to sparse x and dense y vectors. In other words,

for i=0 to nnz-1
y[xInd[i]-idxBase] = c * y[xInd[i]-idxBase] - s*xVal[i]
x[i]               = c * xVal[i]            + s * y[xInd[i]-idxBase]

• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
Input
 handle handle to the cuSPARSE library context. nnz number of elements in vector x. xVal vector with nnz nonzero values of vector x. xInd integer vector with nnz indices of the nonzero values of vector x. y vector in dense format. c cosine element of the rotation matrix. s sine element of the rotation matrix. idxBase CUSPARSE_INDEX_BASE_ZERO or CUSPARSE_INDEX_BASE_ONE.
Output
 xVal updated vector in sparse format (that is unchanged if nnz == 0). y updated vector in dense format (that is unchanged if nnz == 0).

See cusparseStatus_t for the description of the return status

### 8.5. cusparse<t>sctr() [DEPRECATED]

[[DEPRECATED]] use cusparseScatter() instead. The routine will be removed in the next major release

cusparseStatus_t
cusparseSsctr(cusparseHandle_t    handle,
int                 nnz,
const float*        xVal,
const int*          xInd,
float*              y,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseDsctr(cusparseHandle_t    handle,
int                 nnz,
const double*       xVal,
const int*          xInd,
double*             y,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseCsctr(cusparseHandle_t    handle,
int                 nnz,
const cuComplex*    xVal,
const int*          xInd,
cuComplex*          y,
cusparseIndexBase_t idxBase)

cusparseStatus_t
cusparseZsctr(cusparseHandle_t       handle,
int                    nnz,
const cuDoubleComplex* xVal,
const int*             xInd,
cuDoubleComplex*       y,
cusparseIndexBase_t    idxBase)


This function scatters the elements of the vector x in sparse format into the vector y in dense format. It modifies only the elements of y whose indices are listed in the array xInd.

• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
Input
 handle handle to the cuSPARSE library context. nnz number of elements in vector x. xVal vector with nnz nonzero values of vector x. xInd integer vector with nnz indices of the nonzero values of vector x. y dense vector (of size≥max(xInd)-idxBase+1). idxBase CUSPARSE_INDEX_BASE_ZERO or CUSPARSE_INDEX_BASE_ONE.
Output
 y vector with nnz nonzero values that were scattered from vector x (that is unchanged if nnz == 0).

See cusparseStatus_t for the description of the return status

## 9. cuSPARSE Level 2 Function Reference

This chapter describes the sparse linear algebra functions that perform operations between sparse matrices and dense vectors.

In particular, the solution of sparse triangular linear systems is implemented in two phases. First, during the analysis phase, the sparse triangular matrix is analyzed to determine the dependencies between its elements by calling the appropriate csrsv2_analysis() function. The analysis is specific to the sparsity pattern of the given matrix and to the selected cusparseOperation_t type. The information from the analysis phase is stored in the parameter of type csrsv2Info_t that has been initialized previously with a call to cusparseCreateCsrsv2Info().

Second, during the solve phase, the given sparse triangular linear system is solved using the information stored in the csrsv2Info_t parameter by calling the appropriate csrsv2_solve() function. The solve phase may be performed multiple times with different right-hand sides, while the analysis phase needs to be performed only once. This is especially useful when a sparse triangular linear system must be solved for a set of different right-hand sides one at a time, while its coefficient matrix remains the same.

Finally, once all the solves have completed, the opaque data structure pointed to by the csrsv2Info_t parameter can be released by calling cusparseDestroyCsrsv2Info()

### 9.1. cusparse<t>bsrmv()

cusparseStatus_t
cusparseSbsrmv(cusparseHandle_t         handle,
cusparseDirection_t      dir,
cusparseOperation_t      trans,
int                      mb,
int                      nb,
int                      nnzb,
const float*             alpha,
const cusparseMatDescr_t descr,
const float*             bsrVal,
const int*               bsrRowPtr,
const int*               bsrColInd,
int                      blockDim,
const float*             x,
const float*             beta,
float*                   y)

cusparseStatus_t
cusparseDbsrmv(cusparseHandle_t         handle,
cusparseDirection_t      dir,
cusparseOperation_t      trans,
int                      mb,
int                      nb,
int                      nnzb,
const double*            alpha,
const cusparseMatDescr_t descr,
const double*            bsrVal,
const int*               bsrRowPtr,
const int*               bsrColInd,
int                      blockDim,
const double*            x,
const double*            beta,
double*                  y)

cusparseStatus_t
cusparseCbsrmv(cusparseHandle_t         handle,
cusparseDirection_t      dir,
cusparseOperation_t      trans,
int                      mb,
int                      nb,
int                      nnzb,
const cuComplex*         alpha,
const cusparseMatDescr_t descr,
const cuComplex*         bsrVal,
const int*               bsrRowPtr,
const int*               bsrColInd,
int                      blockDim,
const cuComplex*         x,
const cuComplex*         beta,
cuComplex*               y)

cusparseStatus_t
cusparseZbsrmv(cusparseHandle_t         handle,
cusparseDirection_t      dir,
cusparseOperation_t      trans,
int                      mb,
int                      nb,
int                      nnzb,
const cuDoubleComplex*   alpha,
const cusparseMatDescr_t descr,
const cuDoubleComplex*   bsrVal,
const int*               bsrRowPtr,
const int*               bsrColInd,
int                      blockDim,
const cuDoubleComplex*   x,
const cuDoubleComplex*   beta,
cuDoubleComplex*         y)


This function performs the matrix-vector operation

 $\text{y}=\alpha \ast \text{op}\left(A\right)\ast \text{x}+\beta \ast \text{y}$

where sparse matrix that is defined in BSR storage format by the three arrays bsrVal, bsrRowPtr, and bsrColInd); x and y are vectors; are scalars; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if trans == CUSPARSE_OPERATION_TRANSPOSE}\\ {A}^{H}& \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE}\end{array}\right\$
bsrmv() has the following properties:
• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
• Only blockDim > 1 is supported
• Only CUSPARSE_OPERATION_NON_TRANSPOSE is supported, that is

 $\text{y}=\alpha \ast A\ast \text{x}+\beta \ast \text{y}$
• Only CUSPARSE_MATRIX_TYPE_GENERAL is supported.
• The size of vector x should be $\left(nb\ast blockDim\right)$ at least, and the size of vector y should be $\left(mb\ast blockDim\right)$ at least; otherwise, the kernel may return CUSPARSE_STATUS_EXECUTION_FAILED because of an out-of-bounds array.

For example, suppose the user has a CSR format and wants to try bsrmv(), the following code demonstrates how to use csr2bsr() conversion and bsrmv() multiplication in single precision.

// Suppose that A is m x n sparse matrix represented by CSR format,
// hx is a host vector of size n, and hy is also a host vector of size m.
// m and n are not multiple of blockDim.
// step 1: transform CSR to BSR with column-major order
int base, nnz;
int nnzb;
cusparseDirection_t dirA = CUSPARSE_DIRECTION_COLUMN;
int mb = (m + blockDim-1)/blockDim;
int nb = (n + blockDim-1)/blockDim;
cudaMalloc((void**)&bsrRowPtrC, sizeof(int) *(mb+1));
cusparseXcsr2bsrNnz(handle, dirA, m, n,
descrA, csrRowPtrA, csrColIndA, blockDim,
descrC, bsrRowPtrC, &nnzb);
cudaMalloc((void**)&bsrColIndC, sizeof(int)*nnzb);
cudaMalloc((void**)&bsrValC, sizeof(float)*(blockDim*blockDim)*nnzb);
cusparseScsr2bsr(handle, dirA, m, n,
descrA, csrValA, csrRowPtrA, csrColIndA, blockDim,
descrC, bsrValC, bsrRowPtrC, bsrColIndC);
// step 2: allocate vector x and vector y large enough for bsrmv
cudaMalloc((void**)&x, sizeof(float)*(nb*blockDim));
cudaMalloc((void**)&y, sizeof(float)*(mb*blockDim));
cudaMemcpy(x, hx, sizeof(float)*n, cudaMemcpyHostToDevice);
cudaMemcpy(y, hy, sizeof(float)*m, cudaMemcpyHostToDevice);
// step 3: perform bsrmv
cusparseSbsrmv(handle, dirA, transA, mb, nb, nnzb, &alpha,
descrC, bsrValC, bsrRowPtrC, bsrColIndC, blockDim, x, &beta, y);

Input
 handle handle to the cuSPARSE library context. dir storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN. trans the operation $\text{op}\left(A\right)$ . Only CUSPARSE_OPERATION_NON_TRANSPOSE is supported. mb number of block rows of matrix $A$. nb number of block columns of matrix $A$. nnzb number of nonzero blocks of matrix $A$. alpha scalar used for multiplication. descr the descriptor of matrix $A$. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. bsrVal array of nnz$\left(=$csrRowPtrA(mb)$-$csrRowPtrA(0)$\right)$ nonzero blocks of matrix $A$. bsrRowPtr integer array of mb$+1$ elements that contains the start of every block row and the end of the last block row plus one. bsrColInd integer array of nnz$\left(=$csrRowPtrA(mb)$-$csrRowPtrA(0)$\right)$ column indices of the nonzero blocks of matrix $A$. blockDim block dimension of sparse matrix $A$, larger than zero. x vector of $nb\ast blockDim$ elements. beta scalar used for multiplication. If beta is zero, y does not have to be a valid input. y vector of $mb\ast blockDim$ elements.
Output
 y updated vector.

See cusparseStatus_t for the description of the return status.

### 9.2. cusparse<t>bsrxmv()

cusparseStatus_t
cusparseSbsrxmv(cusparseHandle_t         handle,
cusparseDirection_t      dir,
cusparseOperation_t      trans,
int                      mb,
int                      nb,
int                      nnzb,
const float*             alpha,
const cusparseMatDescr_t descr,
const float*             bsrVal,
const int*               bsrRowPtr,
const int*               bsrEndPtr,
const int*               bsrColInd,
int                      blockDim,
const float*             x,
const float*             beta,
float*                   y)

cusparseStatus_t
cusparseDbsrxmv(cusparseHandle_t         handle,
cusparseDirection_t      dir,
cusparseOperation_t      trans,
int                      mb,
int                      nb,
int                      nnzb,
const double*            alpha,
const cusparseMatDescr_t descr,
const double*            bsrVal,
const int*               bsrRowPtr,
const int*               bsrEndPtr,
const int*               bsrColInd,
int                      blockDim,
const double*            x,
const double*            beta,
double*                  y)

cusparseStatus_t
cusparseCbsrxmv(cusparseHandle_t         handle,
cusparseDirection_t      dir,
cusparseOperation_t      trans,
int                      mb,
int                      nb,
int                      nnzb,
const cuComplex*         alpha,
const cusparseMatDescr_t descr,
const cuComplex*         bsrVal,
const int*               bsrRowPtr,
const int*               bsrEndPtr,
const int*               bsrColInd,
int                      blockDim,
const cuComplex*         x,
const cuComplex*         beta,
cuComplex*               y)

cusparseStatus_t
cusparseZbsrxmv(cusparseHandle_t         handle,
cusparseDirection_t      dir,
cusparseOperation_t      trans,
int                      mb,
int                      nb,
int                      nnzb,
const cuDoubleComplex*   alpha,
const cusparseMatDescr_t descr,
const cuDoubleComplex*   bsrVal,
const int*               bsrRowPtr,
const int*               bsrEndPtr,
const int*               bsrColInd,
int                      blockDim,
const cuDoubleComplex*   x,
const cuDoubleComplex*   beta,
cuDoubleComplex*         y)


This function performs a bsrmv and a mask operation

 $\text{y(mask)}=\left(\alpha \ast \text{op}\left(A\right)\ast \text{x}+\beta \ast \text{y}\right)\text{(mask)}$

where sparse matrix that is defined in BSRX storage format by the four arrays bsrVal, bsrRowPtr, bsrEndPtr, and bsrColInd); x and y are vectors; are scalars; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if trans == CUSPARSE_OPERATION_TRANSPOSE}\\ {A}^{H}& \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE}\end{array}\right\$

The mask operation is defined by array bsrMaskPtr which contains updated block row indices of $y$ . If row $i$ is not specified in bsrMaskPtr, then bsrxmv() does not touch row block $i$ of $A$ and $y$ .

For example, consider the $2×3$ block matrix $A$:

 $\begin{array}{r}A=\left[\begin{array}{ccc}{A}_{11}& {A}_{12}& O\\ {A}_{21}& {A}_{22}& {A}_{23}\end{array}\right]\end{array}$

and its one-based BSR format (three vector form) is

 $\begin{array}{rcl}\text{bsrVal}& =& \left[\begin{array}{ccccc}{A}_{11}& {A}_{12}& {A}_{21}& {A}_{22}& {A}_{23}\end{array}\right]\\ \text{bsrRowPtr}& =& \left[\begin{array}{ccc}1\phantom{.0}& 3\phantom{.0}& 6\end{array}\right]\\ \text{bsrColInd}& =& \left[\begin{array}{ccccc}1\phantom{.0}& 2\phantom{.0}& 1\phantom{.0}& 2\phantom{.0}& 3\end{array}\right]\end{array}$

Suppose we want to do the following bsrmv operation on a matrix $\overline{A}$ which is slightly different from $A$ .

 $\left[\begin{array}{c}{y}_{1}\\ {y}_{2}\end{array}\right]:=alpha\ast \left(\stackrel{˜}{A}=\left[\begin{array}{ccc}O& O& O\\ O& {A}_{22}& O\end{array}\right]\right)\ast \left[\begin{array}{c}{x}_{1}\\ {x}_{2}\\ {x}_{3}\end{array}\right]+\left[\begin{array}{c}{y}_{1}\\ beta\ast {y}_{2}\end{array}\right]$

We don’t need to create another BSR format for the new matrix $\overline{A}$ , all that we should do is to keep bsrVal and bsrColInd unchanged, but modify bsrRowPtr and add an additional array bsrEndPtr which points to the last nonzero elements per row of $\overline{A}$ plus 1.

For example, the following bsrRowPtr and bsrEndPtr can represent matrix $\overline{A}$ :

 $\begin{array}{rcl}\text{bsrRowPtr}& =& \left[\begin{array}{cc}1\phantom{.0}& 4\end{array}\right]\\ \text{bsrEndPtr}& =& \left[\begin{array}{cc}1\phantom{.0}& 5\end{array}\right]\end{array}$

Further we can use a mask operator (specified by array bsrMaskPtr) to update particular block row indices of $y$ only because ${y}_{1}$ is never changed. In this case, bsrMaskPtr$=$ [2] and sizeOfMask=1.

The mask operator is equivalent to the following operation:

 $\left[\begin{array}{c}?\\ {y}_{2}\end{array}\right]:=alpha\ast \left[\begin{array}{ccc}?& ?& ?\\ O& {A}_{22}& O\end{array}\right]\ast \left[\begin{array}{c}{x}_{1}\\ {x}_{2}\\ {x}_{3}\end{array}\right]+beta\ast \left[\begin{array}{c}?\\ {y}_{2}\end{array}\right]$

If a block row is not present in the bsrMaskPtr, then no calculation is performed on that row, and the corresponding value in y is unmodified. The question mark "?" is used to inidcate row blocks not in bsrMaskPtr.

In this case, first row block is not present in bsrMaskPtr, so bsrRowPtr[0] and bsrEndPtr[0] are not touched also.

 $\begin{array}{rcl}\text{bsrRowPtr}& =& \left[\begin{array}{cc}?\phantom{.0}& 4\end{array}\right]\\ \text{bsrEndPtr}& =& \left[\begin{array}{cc}?\phantom{.0}& 5\end{array}\right]\end{array}$
bsrxmv() has the following properties:
• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
A couple of comments on bsrxmv():
• Only blockDim > 1 is supported
• Only CUSPARSE_OPERATION_NON_TRANSPOSE and CUSPARSE_MATRIX_TYPE_GENERAL are supported.
• Parameters bsrMaskPtr, bsrRowPtr, bsrEndPtr and bsrColInd are consistent with base index, either one-based or zero-based. The above example is one-based.

Input
 handle handle to the cuSPARSE library context. dir storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN. trans the operation $\text{op}\left(A\right)$ . Only CUSPARSE_OPERATION_NON_TRANSPOSE is supported. sizeOfMask number of updated block rows of $y$. mb number of block rows of matrix $A$. nb number of block columns of matrix $A$. nnzb number of nonzero blocks of matrix $A$. alpha scalar used for multiplication. descr the descriptor of matrix $A$. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. bsrVal array of nnz nonzero blocks of matrix $A$. bsrMaskPtr integer array of sizeOfMask elements that contains the indices corresponding to updated block rows. bsrRowPtr integer array of mb elements that contains the start of every block row. bsrEndPtr integer array of mb elements that contains the end of the every block row plus one. bsrColInd integer array of nnzb column indices of the nonzero blocks of matrix $A$. blockDim block dimension of sparse matrix $A$, larger than zero. x vector of $nb\ast blockDim$ elements. beta scalar used for multiplication. If beta is zero, y does not have to be a valid input. y vector of $mb\ast blockDim$ elements.

See cusparseStatus_t for the description of the return status

### 9.3. cusparse<t>bsrsv2_bufferSize()

cusparseStatus_t
cusparseSbsrsv2_bufferSize(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const cusparseMatDescr_t descrA,
float*                   bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseDbsrsv2_bufferSize(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const cusparseMatDescr_t descrA,
double*                  bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseCbsrsv2_bufferSize(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const cusparseMatDescr_t descrA,
cuComplex*               bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseZbsrsv2_bufferSize(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const cusparseMatDescr_t descrA,
cuDoubleComplex*         bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
int*                     pBufferSizeInBytes)


This function returns size of the buffer used in bsrsv2, a new sparse triangular linear system op(A)*y =$\alpha$x.

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA); x and y are the right-hand-side and the solution vectors; $\alpha$ is a scalar; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if trans == CUSPARSE_OPERATION_TRANSPOSE}\\ {A}^{H}& \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE}\end{array}\right\$

Although there are six combinations in terms of parameter trans and the upper (lower) triangular part of A, bsrsv2_bufferSize() returns the maximum size buffer among these combinations. The buffer size depends on the dimensions mb, blockDim, and the number of nonzero blocks of the matrix nnzb. If the user changes the matrix, it is necessary to call bsrsv2_bufferSize() again to have the correct buffer size; otherwise a segmentation fault may occur.

• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
Input
 handle handle to the cuSPARSE library context. dirA storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN. transA the operation $\text{op}\left(A\right)$ . mb number of block rows of matrix A. nnzb number of nonzero blocks of matrix A. descrA the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT. bsrValA array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ nonzero blocks of matrix A. bsrRowPtrA integer array of mb$+1$ elements that contains the start of every block row and the end of the last block row plus one. bsrColIndA integer array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ column indices of the nonzero blocks of matrix A. blockDim block dimension of sparse matrix A; must be larger than zero.
Output
 info record of internal states based on different algorithms. pBufferSizeInBytes number of bytes of the buffer used in the bsrsv2_analysis() and bsrsv2_solve().

See cusparseStatus_t for the description of the return status.

### 9.4. cusparse<t>bsrsv2_analysis()

cusparseStatus_t
cusparseSbsrsv2_analysis(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const cusparseMatDescr_t descrA,
const float*             bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseDbsrsv2_analysis(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const cusparseMatDescr_t descrA,
const double*            bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseDbsrsv2_analysis(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const cusparseMatDescr_t descrA,
const cuComplex*         bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseZbsrsv2_analysis(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const cusparseMatDescr_t descrA,
const cuDoubleComplex*   bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)


This function performs the analysis phase of bsrsv2, a new sparse triangular linear system op(A)*y =$\alpha$x.

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA); x and y are the right-hand side and the solution vectors; $\alpha$ is a scalar; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if trans == CUSPARSE_OPERATION_TRANSPOSE}\\ {A}^{H}& \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE}\end{array}\right\$

The block of BSR format is of size blockDim*blockDim, stored as column-major or row-major as determined by parameter dirA, which is either CUSPARSE_DIRECTION_COLUMN or CUSPARSE_DIRECTION_ROW. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

It is expected that this function will be executed only once for a given matrix and a particular operation type.

This function requires a buffer size returned by bsrsv2_bufferSize(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function bsrsv2_analysis() reports a structural zero and computes level information, which stored in the opaque structure info. The level information can extract more parallelism for a triangular solver. However bsrsv2_solve() can be done without level information. To disable level information, the user needs to specify the policy of the triangular solver as CUSPARSE_SOLVE_POLICY_NO_LEVEL.

Function bsrsv2_analysis() always reports the first structural zero, even when parameter policy is CUSPARSE_SOLVE_POLICY_NO_LEVEL. No structural zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even if block A(j,j) is missing for some j. The user needs to call cusparseXbsrsv2_zeroPivot() to know where the structural zero is.

It is the user's choice whether to call bsrsv2_solve() if bsrsv2_analysis() reports a structural zero. In this case, the user can still call bsrsv2_solve(), which will return a numerical zero at the same position as a structural zero. However the result x is meaningless.

• This function requires temporary extra storage that is allocated internally
• The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available
• The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available
Input
 handle handle to the cuSPARSE library context. dirA storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN. transA the operation $\text{op}\left(A\right)$ . mb number of block rows of matrix A. nnzb number of nonzero blocks of matrix A. descrA the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT. bsrValA array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ nonzero blocks of matrix A. bsrRowPtrA integer array of mb$+1$ elements that contains the start of every block row and the end of the last block row plus one. bsrColIndA integer array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ column indices of the nonzero blocks of matrix A. blockDim block dimension of sparse matrix A, larger than zero. info structure initialized using cusparseCreateBsrsv2Info(). policy the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL. pBuffer buffer allocated by the user, the size is return by bsrsv2_bufferSize().
Output
 info structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged).

See cusparseStatus_t for the description of the return status.

### 9.5. cusparse<t>bsrsv2_solve()

cusparseStatus_t
cusparseSbsrsv2_solve(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const float*             alpha,
const cusparseMatDescr_t descrA,
const float*             bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
const float*             x,
float*                   y,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseDbsrsv2_solve(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const double*            alpha,
const cusparseMatDescr_t descrA,
const double*            bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
const double*            x,
double*                  y,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseCbsrsv2_solve(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const cuComplex*         alpha,
const cusparseMatDescr_t descrA,
const cuComplex*         bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
const cuComplex*         x,
cuComplex*               y,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseZbsrsv2_solve(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
int                      mb,
int                      nnzb,
const cuDoubleComplex*   alpha,
const cusparseMatDescr_t descrA,
const cuDoubleComplex*   bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
bsrsv2Info_t             info,
const cuDoubleComplex*   x,
cuDoubleComplex*         y,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)


This function performs the solve phase of bsrsv2, a new sparse triangular linear system op(A)*y =$\alpha$x.

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA); x and y are the right-hand-side and the solution vectors; $\alpha$ is a scalar; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if trans == CUSPARSE_OPERATION_TRANSPOSE}\\ {A}^{H}& \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE}\end{array}\right\$

The block in BSR format is of size blockDim*blockDim, stored as column-major or row-major as determined by parameter dirA, which is either CUSPARSE_DIRECTION_COLUMN or CUSPARSE_DIRECTION_ROW. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored. Function bsrsv02_solve() can support an arbitrary blockDim.

This function may be executed multiple times for a given matrix and a particular operation type.

This function requires a buffer size returned by bsrsv2_bufferSize(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Although bsrsv2_solve() can be done without level information, the user still needs to be aware of consistency. If bsrsv2_analysis() is called with policy CUSPARSE_SOLVE_POLICY_USE_LEVEL, bsrsv2_solve() can be run with or without levels. On the other hand, if bsrsv2_analysis() is called with CUSPARSE_SOLVE_POLICY_NO_LEVEL, bsrsv2_solve() can only accept CUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise, CUSPARSE_STATUS_INVALID_VALUE is returned.

The level information may not improve the performance, but may spend extra time doing analysis. For example, a tridiagonal matrix has no parallelism. In this case, CUSPARSE_SOLVE_POLICY_NO_LEVEL performs better than CUSPARSE_SOLVE_POLICY_USE_LEVEL. If the user has an iterative solver, the best approach is to do bsrsv2_analysis() with CUSPARSE_SOLVE_POLICY_USE_LEVEL once. Then do bsrsv2_solve() with CUSPARSE_SOLVE_POLICY_NO_LEVEL in the first run, and with CUSPARSE_SOLVE_POLICY_USE_LEVEL in the second run, and pick the fastest one to perform the remaining iterations.

Function bsrsv02_solve() has the same behavior as csrsv02_solve(). That is, bsr2csr(bsrsv02(A)) = csrsv02(bsr2csr(A)). The numerical zero of csrsv02_solve() means there exists some zero A(j,j). The numerical zero of bsrsv02_solve() means there exists some block A(j,j) that is not invertible.

Function bsrsv2_solve() reports the first numerical zero, including a structural zero. No numerical zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even if A(j,j) is not invertible for some j. The user needs to call cusparseXbsrsv2_zeroPivot() to know where the numerical zero is.

The function supports the following properties if pBuffer != NULL
• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture

For example, suppose L is a lower triangular matrix with unit diagonal, then the following code solves L*y=x by level information.

// Suppose that L is m x m sparse matrix represented by BSR format,
// The number of block rows/columns is mb, and
// the number of nonzero blocks is nnzb.
// L is lower triangular with unit diagonal.
// Assumption:
// - dimension of matrix L is m(=mb*blockDim),
// - matrix L has nnz(=nnzb*blockDim*blockDim) nonzero elements,
// - handle is already created by cusparseCreate(),
// - (d_bsrRowPtr, d_bsrColInd, d_bsrVal) is BSR of L on device memory,
// - d_x is right hand side vector on device memory.
// - d_y is solution vector on device memory.
// - d_x and d_y are of size m.
cusparseMatDescr_t descr = 0;
bsrsv2Info_t info = 0;
int pBufferSize;
void *pBuffer = 0;
int structural_zero;
int numerical_zero;
const double alpha = 1.;
const cusparseSolvePolicy_t policy = CUSPARSE_SOLVE_POLICY_USE_LEVEL;
const cusparseOperation_t trans = CUSPARSE_OPERATION_NON_TRANSPOSE;
const cusparseDirection_t dir = CUSPARSE_DIRECTION_COLUMN;

// step 1: create a descriptor which contains
// - matrix L is base-1
// - matrix L is lower triangular
// - matrix L has unit diagonal, specified by parameter CUSPARSE_DIAG_TYPE_UNIT
//   (L may not have all diagonal elements.)
cusparseCreateMatDescr(&descr);
cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatFillMode(descr, CUSPARSE_FILL_MODE_LOWER);
cusparseSetMatDiagType(descr, CUSPARSE_DIAG_TYPE_UNIT);

// step 2: create a empty info structure
cusparseCreateBsrsv2Info(&info);

// step 3: query how much memory used in bsrsv2, and allocate the buffer
cusparseDbsrsv2_bufferSize(handle, dir, trans, mb, nnzb, descr,
d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, &pBufferSize);

// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.
cudaMalloc((void**)&pBuffer, pBufferSize);

// step 4: perform analysis
cusparseDbsrsv2_analysis(handle, dir, trans, mb, nnzb, descr,
d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim,
info, policy, pBuffer);
// L has unit diagonal, so no structural zero is reported.
status = cusparseXbsrsv2_zeroPivot(handle, info, &structural_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
printf("L(%d,%d) is missing\n", structural_zero, structural_zero);
}

// step 5: solve L*y = x
cusparseDbsrsv2_solve(handle, dir, trans, mb, nnzb, &alpha, descr,
d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info,
d_x, d_y, policy, pBuffer);
// L has unit diagonal, so no numerical zero is reported.
status = cusparseXbsrsv2_zeroPivot(handle, info, &numerical_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
printf("L(%d,%d) is zero\n", numerical_zero, numerical_zero);
}

// step 6: free resources
cudaFree(pBuffer);
cusparseDestroyBsrsv2Info(info);
cusparseDestroyMatDescr(descr);
cusparseDestroy(handle);

Input
 handle handle to the cuSPARSE library context. dirA storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN. transA the operation $\text{op}\left(A\right)$ . mb number of block rows and block columns of matrix A. alpha scalar used for multiplication. descrA the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT. bsrValA array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ nonzero blocks of matrix A. bsrRowPtrA integer array of mb$+1$ elements that contains the start of every block row and the end of the last block row plus one. bsrColIndA integer array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ column indices of the nonzero blocks of matrix A. blockDim block dimension of sparse matrix A, larger than zero. info structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged). x right-hand-side vector of size m. policy the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL. pBuffer buffer allocated by the user, the size is returned by bsrsv2_bufferSize().
Output
 y solution vector of size m.

See cusparseStatus_t for the description of the return status.

### 9.6. cusparseXbsrsv2_zeroPivot()

cusparseStatus_t
cusparseXbsrsv2_zeroPivot(cusparseHandle_t handle,
bsrsv2Info_t     info,
int*             position)


If the returned error code is CUSPARSE_STATUS_ZERO_PIVOT, position=j means A(j,j) is either structural zero or numerical zero (singular block). Otherwise position=-1.

The position can be 0-based or 1-based, the same as the matrix.

Function cusparseXbsrsv2_zeroPivot() is a blocking call. It calls cudaDeviceSynchronize() to make sure all previous kernels are done.

The position can be in the host memory or device memory. The user can set the proper mode with cusparseSetPointerMode().

• The routine requires no extra storage
• The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available
• The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available
Input
 handle handle to the cuSPARSE library context. info info contains a structural zero or numerical zero if the user already called bsrsv2_analysis() or bsrsv2_solve().
Output
 position if no structural or numerical zero, position is -1; otherwise if A(j,j) is missing or U(j,j) is zero, position=j.

See cusparseStatus_t for the description of the return status

### 9.7. cusparseCsrmvEx() [DEPRECATED]

[[DEPRECATED]] use cusparseSpMV() instead. The routine will be removed in the next major release

cusparseStatus_t
cusparseCsrmvEx_bufferSize(cusparseHandle_t         handle,
cusparseAlgMode_t        alg,
cusparseOperation_t      transA,
int                      m,
int                      n,
int                      nnz,
const void*              alpha,
const cusparseMatDescr_t descrA,
const void*              csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
const void*              x,
const void*              beta,
void*                    y,
size_t*                  bufferSizeInBytes)

cusparseStatus_t
cusparseCsrmvEx(cusparseHandle_t         handle,
cusparseAlgMode_t        alg,
cusparseOperation_t      transA,
int                      m,
int                      n,
int                      nnz,
const void*              alpha,
const cusparseMatDescr_t descrA,
const void*              csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
const void*              x,
const void*              beta,
void*                    y,
void*                    buffer)


This function performs the matrix-vector operation

 $\text{y}=\alpha \ast \text{op}\left(A\right)\ast \text{x}+\beta \ast \text{y}$

A is an m×n sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA); x and y are vectors;

The function cusparseCsrmvEx_bufferSize returns the size of the workspace needed by cusparseCsrmvEx.

The function has the following limitations:
• All pointers should be aligned with 128 bytes
• Only CUSPARSE_OPERATION_NON_TRANSPOSE operation is supported
• Only CUSPARSE_MATRIX_TYPE_GENERAL matrix type is supported
• Only CUSPARSE_INDEX_BASE_ZERO indexing is supported
• Half-precision is not supported
• The minimum GPU architecture supported is SM_53
The function has the following properties:
• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
Input specifically required by cusparseCsrmvEx
 alg Algorithm implementation for csrmv, see cusparseAlgMode_t for possible values. alphatype Data type of alpha. csrValAtype Data type of csrValA. xtype Data type of x. betatype Data type of beta. ytype Data type of y. executiontype Data type used for computation. bufferSizeInBytes Pointer to a size_t variable, which will be assigned with the size of workspace needed by cusparseCsrmvEx. buffer Pointer to workspace buffer

See cusparseStatus_t for the description of the return status

### 9.8. cusparse<t>csrsv2_bufferSize() [DEPRECATED]

[[DEPRECATED]] use cusparseSpSV() instead. The routine will be removed in the next major release

cusparseStatus_t
cusparseScsrsv2_bufferSize(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const cusparseMatDescr_t descrA,
float*                   csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseDcsrsv2_bufferSize(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const cusparseMatDescr_t descrA,
double*                  csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseCcsrsv2_bufferSize(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const cusparseMatDescr_t descrA,
cuComplex*               csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseZcsrsv2_bufferSize(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const cusparseMatDescr_t descrA,
cuDoubleComplex*         csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
int*                     pBufferSizeInBytes)


This function returns the size of the buffer used in csrsv2, a new sparse triangular linear system op(A)*y =$\alpha$x.

A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA); x and y are the right-hand-side and the solution vectors; $\alpha$ is a scalar; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if trans == CUSPARSE_OPERATION_TRANSPOSE}\\ {A}^{H}& \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE}\end{array}\right\$

Although there are six combinations in terms of the parameter trans and the upper (lower) triangular part of A, csrsv2_bufferSize() returns the maximum size buffer of these combinations. The buffer size depends on the dimension and the number of nonzero elements of the matrix. If the user changes the matrix, it is necessary to call csrsv2_bufferSize() again to have the correct buffer size; otherwise, a segmentation fault may occur.

• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
Input
 handle handle to the cuSPARSE library context. transA the operation $\text{op}\left(A\right)$ . m number of rows of matrix A. nnz number of nonzero elements of matrix A. descrA the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT. csrValA array of nnz$\left(=$csrRowPtrA(m)$-$csrRowPtrA(0)$\right)$ nonzero elements of matrix A. csrRowPtrA integer array of m$+1$ elements that contains the start of every row and the end of the last row plus one. csrColIndA integer array of nnz$\left(=$csrRowPtrA(m)$-$csrRowPtrA(0)$\right)$ column indices of the nonzero elements of matrix A.
Output
 info record of internal states based on different algorithms. pBufferSizeInBytes number of bytes of the buffer used in the csrsv2_analysis and csrsv2_solve.

See cusparseStatus_t for the description of the return status

### 9.9. cusparse<t>csrsv2_analysis() [DEPRECATED]

[[DEPRECATED]] use cusparseSpSV() instead. The routine will be removed in the next major release

cusparseStatus_t
cusparseScsrsv2_analysis(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const cusparseMatDescr_t descrA,
const float*             csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseDcsrsv2_analysis(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const cusparseMatDescr_t descrA,
const double*            csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseCcsrsv2_analysis(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const cusparseMatDescr_t descrA,
const cuComplex*         csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseZcsrsv2_analysis(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const cusparseMatDescr_t descrA,
const cuDoubleComplex*   csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)


This function performs the analysis phase of csrsv2, a new sparse triangular linear system op(A)*y =$\alpha$x.

A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA); x and y are the right-hand-side and the solution vectors; $\alpha$ is a scalar; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if trans == CUSPARSE_OPERATION_TRANSPOSE}\\ {A}^{H}& \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE}\end{array}\right\$

It is expected that this function will be executed only once for a given matrix and a particular operation type.

This function requires a buffer size returned by csrsv2_bufferSize(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function csrsv2_analysis() reports a structural zero and computes level information that is stored in opaque structure info. The level information can extract more parallelism for a triangular solver. However csrsv2_solve() can be done without level information. To disable level information, the user needs to specify the policy of the triangular solver as CUSPARSE_SOLVE_POLICY_NO_LEVEL.

Function csrsv2_analysis() always reports the first structural zero, even if the policy is CUSPARSE_SOLVE_POLICY_NO_LEVEL. No structural zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even if A(j,j) is missing for some j. The user needs to call cusparseXcsrsv2_zeroPivot() to know where the structural zero is.

It is the user's choice whether to call csrsv2_solve() if csrsv2_analysis() reports a structural zero. In this case, the user can still call csrsv2_solve() which will return a numerical zero in the same position as the structural zero. However the result x is meaningless.

• This function requires temporary extra storage that is allocated internally
• The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available
• The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available
Input
 handle handle to the cuSPARSE library context. transA the operation $\text{op}\left(A\right)$ . m number of rows of matrix A. nnz number of nonzero elements of matrix A. descrA the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT. csrValA array of nnz$\left(=$csrRowPtrA(m)$-$csrRowPtrA(0)$\right)$ nonzero elements of matrix A. csrRowPtrA integer array of m$+1$ elements that contains the start of every row and the end of the last row plus one. csrColIndA integer array of nnz$\left(=$csrRowPtrA(m)$-$csrRowPtrA(0)$\right)$ column indices of the nonzero elements of matrix A. info structure initialized using cusparseCreateCsrsv2Info(). policy The supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL. pBuffer buffer allocated by the user, the size is returned by csrsv2_bufferSize().
Output
 info structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged).

See cusparseStatus_t for the description of the return status

### 9.10. cusparse<t>csrsv2_solve() [DEPRECATED]

[[DEPRECATED]] use cusparseSpSV() instead. The routine will be removed in the next major release

cusparseStatus_t
cusparseScsrsv2_solve(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const float*             alpha,
const cusparseMatDescr_t descra,
const float*             csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
const float*             x,
float*                   y,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseDcsrsv2_solve(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const double*            alpha,
const cusparseMatDescr_t descra,
const double*            csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
const double*            x,
double*                  y,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseCcsrsv2_solve(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const cuComplex*         alpha,
const cusparseMatDescr_t descra,
const cuComplex*         csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
const cuComplex*         x,
cuComplex*               y,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseZcsrsv2_solve(cusparseHandle_t         handle,
cusparseOperation_t      transA,
int                      m,
int                      nnz,
const cuDoubleComplex*   alpha,
const cusparseMatDescr_t descra,
const cuDoubleComplex*   csrValA,
const int*               csrRowPtrA,
const int*               csrColIndA,
csrsv2Info_t             info,
const cuDoubleComplex*   x,
cuDoubleComplex*         y,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)


This function performs the solve phase of csrsv2, a new sparse triangular linear system op(A)*y =$\alpha$x.

A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA); x and y are the right-hand-side and the solution vectors; $\alpha$ is a scalar; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if trans == CUSPARSE_OPERATION_TRANSPOSE}\\ {A}^{H}& \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE}\end{array}\right\$

This function may be executed multiple times for a given matrix and a particular operation type.

This function requires the buffer size returned by csrsv2_bufferSize(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Although csrsv2_solve() can be done without level information, the user still needs to be aware of consistency. If csrsv2_analysis() is called with policy CUSPARSE_SOLVE_POLICY_USE_LEVEL, csrsv2_solve() can be run with or without levels. On the contrary, if csrsv2_analysis() is called with CUSPARSE_SOLVE_POLICY_NO_LEVEL, csrsv2_solve() can only accept CUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise, CUSPARSE_STATUS_INVALID_VALUE is returned.

The level information may not improve the performance but spend extra time doing analysis. For example, a tridiagonal matrix has no parallelism. In this case, CUSPARSE_SOLVE_POLICY_NO_LEVEL performs better than CUSPARSE_SOLVE_POLICY_USE_LEVEL. If the user has an iterative solver, the best approach is to do csrsv2_analysis() with CUSPARSE_SOLVE_POLICY_USE_LEVEL once. Then do csrsv2_solve() with CUSPARSE_SOLVE_POLICY_NO_LEVEL in the first run and with CUSPARSE_SOLVE_POLICY_USE_LEVEL in the second run, picking faster one to perform the remaining iterations.

Function csrsv2_solve() reports the first numerical zero, including a structural zero. If status is 0, no numerical zero was found. Furthermore, no numerical zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even if A(j,j) is zero for some j. The user needs to call cusparseXcsrsv2_zeroPivot() to know where the numerical zero is.

For example, suppose L is a lower triangular matrix with unit diagonal, the following code solves L*y=x by level information.

// Suppose that L is m x m sparse matrix represented by CSR format,
// L is lower triangular with unit diagonal.
// Assumption:
// - dimension of matrix L is m,
// - matrix L has nnz number zero elements,
// - handle is already created by cusparseCreate(),
// - (d_csrRowPtr, d_csrColInd, d_csrVal) is CSR of L on device memory,
// - d_x is right hand side vector on device memory,
// - d_y is solution vector on device memory.

cusparseMatDescr_t descr = 0;
csrsv2Info_t info = 0;
int pBufferSize;
void *pBuffer = 0;
int structural_zero;
int numerical_zero;
const double alpha = 1.;
const cusparseSolvePolicy_t policy = CUSPARSE_SOLVE_POLICY_USE_LEVEL;
const cusparseOperation_t trans = CUSPARSE_OPERATION_NON_TRANSPOSE;

// step 1: create a descriptor which contains
// - matrix L is base-1
// - matrix L is lower triangular
// - matrix L has unit diagonal, specified by parameter CUSPARSE_DIAG_TYPE_UNIT
//   (L may not have all diagonal elements.)
cusparseCreateMatDescr(&descr);
cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatFillMode(descr, CUSPARSE_FILL_MODE_LOWER);
cusparseSetMatDiagType(descr, CUSPARSE_DIAG_TYPE_UNIT);

// step 2: create a empty info structure
cusparseCreateCsrsv2Info(&info);

// step 3: query how much memory used in csrsv2, and allocate the buffer
cusparseDcsrsv2_bufferSize(handle, trans, m, nnz, descr,
d_csrVal, d_csrRowPtr, d_csrColInd, info, &pBufferSize);
// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.
cudaMalloc((void**)&pBuffer, pBufferSize);

// step 4: perform analysis
cusparseDcsrsv2_analysis(handle, trans, m, nnz, descr,
d_csrVal, d_csrRowPtr, d_csrColInd,
info, policy, pBuffer);
// L has unit diagonal, so no structural zero is reported.
status = cusparseXcsrsv2_zeroPivot(handle, info, &structural_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
printf("L(%d,%d) is missing\n", structural_zero, structural_zero);
}

// step 5: solve L*y = x
cusparseDcsrsv2_solve(handle, trans, m, nnz, &alpha, descr,
d_csrVal, d_csrRowPtr, d_csrColInd, info,
d_x, d_y, policy, pBuffer);
// L has unit diagonal, so no numerical zero is reported.
status = cusparseXcsrsv2_zeroPivot(handle, info, &numerical_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
printf("L(%d,%d) is zero\n", numerical_zero, numerical_zero);
}

// step 6: free resources
cudaFree(pBuffer);
cusparseDestroyCsrsv2Info(info);
cusparseDestroyMatDescr(descr);
cusparseDestroy(handle);


Remark: csrsv2_solve() needs more nonzeros per row to achieve good performance. It would perform better if more than 16 nonzeros per row in average.

The function supports the following properties if pBuffer != NULL
• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
Input
 handle handle to the cuSPARSE library context. transA the operation $\text{op}\left(A\right)$ . m number of rows and columns of matrix A. alpha scalar used for multiplication. descrA the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT. csrValA array of nnz$\left(=$csrRowPtrA(m)$-$csrRowPtrA(0)$\right)$ nonzero elements of matrix A. csrRowPtrA integer array of m$+1$ elements that contains the start of every row and the end of the last row plus one. csrColIndA integer array of nnz$\left(=$csrRowPtrA(m)$-$csrRowPtrA(0)$\right)$ column indices of the nonzero elements of matrix A. info structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged). x right-hand-side vector of size m. policy The supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL. pBuffer buffer allocated by the user, the size is return by csrsv2_bufferSize.
Output
 y solution vector of size m.

See cusparseStatus_t for the description of the return status

### 9.11. cusparseXcsrsv2_zeroPivot() [DEPRECATED]

[[DEPRECATED]] use cusparseSpSV() instead. The routine will be removed in the next major release

cusparseStatus_t
cusparseXcsrsv2_zeroPivot(cusparseHandle_t handle,
csrsv2Info_t     info,
int*             position)


If the returned error code is CUSPARSE_STATUS_ZERO_PIVOT, position=j means A(j,j) has either a structural zero or a numerical zero. Otherwise position=-1.

The position can be 0-based or 1-based, the same as the matrix.

Function cusparseXcsrsv2_zeroPivot() is a blocking call. It calls cudaDeviceSynchronize() to make sure all previous kernels are done.

The position can be in the host memory or device memory. The user can set the proper mode with cusparseSetPointerMode().

• The routine requires no extra storage
• The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available
• The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available
Input
 handle handle to the cuSPARSE library context. info info contains structural zero or numerical zero if the user already called csrsv2_analysis() or csrsv2_solve().
Output
 position if no structural or numerical zero, position is -1; otherwise, if A(j,j) is missing or U(j,j) is zero, position=j.

See cusparseStatus_t for the description of the return status

### 9.12. cusparse<t>gemvi()

cusparseStatus_t
cusparseSgemvi_bufferSize(cusparseHandle_t    handle,
cusparseOperation_t transA,
int                 m,
int                 n,
int                 nnz,
int*                pBufferSize)

cusparseStatus_t
cusparseDgemvi_bufferSize(cusparseHandle_t    handle,
cusparseOperation_t transA,
int                 m,
int                 n,
int                 nnz,
int*                pBufferSize)

cusparseStatus_t
cusparseCgemvi_bufferSize(cusparseHandle_t    handle,
cusparseOperation_t transA,
int                 m,
int                 n,
int                 nnz,
int*                pBufferSize)

cusparseStatus_t
cusparseZgemvi_bufferSize(cusparseHandle_t    handle,
cusparseOperation_t transA,
int                 m,
int                 n,
int                 nnz,
int*                pBufferSize)

cusparseStatus_t
cusparseSgemvi(cusparseHandle_t     handle,
cusparseOperation_t  transA,
int                  m,
int                  n,
const float*         alpha,
const float*         A,
int                  lda,
int                  nnz,
const float*         x,
const int*           xInd,
const float*         beta,
float*               y,
cusparseIndexBase_t  idxBase,
void*                pBuffer)

cusparseStatus_t
cusparseDgemvi(cusparseHandle_t     handle,
cusparseOperation_t  transA,
int                  m,
int                  n,
const double*        alpha,
const double*        A,
int                  lda,
int                  nnz,
const double*        x,
const int*           xInd,
const double*        beta,
double*              y,
cusparseIndexBase_t  idxBase,
void*                pBuffer)

cusparseStatus_t
cusparseCgemvi(cusparseHandle_t     handle,
cusparseOperation_t  transA,
int                  m,
int                  n,
const cuComplex*     alpha,
const cuComplex*     A,
int                  lda,
int                  nnz,
const cuComplex*     x,
const int*           xInd,
const cuComplex*     beta,
cuComplex*           y,
cusparseIndexBase_t  idxBase,
void*                pBuffer)

cusparseStatus_t
cusparseZgemvi(cusparseHandle_t       handle,
cusparseOperation_t    transA,
int                    m,
int                    n,
const cuDoubleComplex* alpha,
const cuDoubleComplex* A,
int                    lda,
int                    nnz,
const cuDoubleComplex* x,
const int*             xInd,
const cuDoubleComplex* beta,
cuDoubleComplex*       y,
cusparseIndexBase_t    idxBase,
void*                  pBuffer)


This function performs the matrix-vector operation

 $\text{y}=\alpha \ast \text{op}\left(A\right)\ast \text{x}+\beta \ast \text{y}$

A is an m×n dense matrix and a sparse vector x that is defined in a sparse storage format by the two arrays xVal, xInd of length nnz, and y is a dense vector; are scalars; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if trans == CUSPARSE_OPERATION_TRANSPOSE}\\ {A}^{H}& \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE}\end{array}\right\$

To simplify the implementation, we have not (yet) optimized the transpose multiple case. We recommend the following for users interested in this case.

1. Convert the matrix from CSR to CSC format using one of the csr2csc() functions. Notice that by interchanging the rows and columns of the result you are implicitly transposing the matrix.

2. Call the gemvi() function with the cusparseOperation_t parameter set to CUSPARSE_OPERATION_NON_TRANSPOSE and with the interchanged rows and columns of the matrix stored in CSC format. This (implicitly) multiplies the vector by the transpose of the matrix in the original CSR format.

• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture

The function cusparse<t>gemvi_bufferSize() returns size of buffer used in cusparse<t>gemvi()

Input
 handle handle to the cuSPARSE library context. trans the operation $\text{op}\left(A\right)$ . m number of rows of matrix A. n number of columns of matrix A. alpha scalar used for multiplication. A the pointer to dense matrix A. lda size of the leading dimension of A. nnz number of nonzero elements of vector x. x sparse vector of nnz elements of size n if $\text{op}\left(A\right)=A$ , and size m if $\text{op}\left(A\right)={A}^{T}$ or $\text{op}\left(A\right)={A}^{H}$ xInd Indices of non-zero values in x beta scalar used for multiplication. If beta is zero, y does not have to be a valid input. y dense vector of m elements if $\text{op}\left(A\right)=A$ , and n elements if $\text{op}\left(A\right)={A}^{T}$ or $\text{op}\left(A\right)={A}^{H}$ idxBase 0 or 1, for 0 based or 1 based indexing, respectively pBufferSize number of elements needed the buffer used in cusparsegemvi(). pBuffer working space buffer
Output
 y updated dense vector.

See cusparseStatus_t for the description of the return status

## 10. cuSPARSE Level 3 Function Reference

This chapter describes sparse linear algebra functions that perform operations between sparse and (usually tall) dense matrices.

In particular, the solution of sparse triangular linear systems with multiple right-hand sides is implemented in two phases. First, during the analysis phase, the sparse triangular matrix is analyzed to determine the dependencies between its elements by calling the appropriate csrsm2_analysis() function. The analysis is specific to the sparsity pattern of the given matrix and to the selected cusparseOperation_t type. The information from the analysis phase is stored in the parameter of type csrsm2Info_t that has been initialized previously with a call to cusparseCreateCsrsm2Info().

Second, during the solve phase, the given sparse triangular linear system is solved using the information stored in the csrsm2Info_t parameter by calling the appropriate csrsm2_solve() function. The solve phase may be performed multiple times with different multiple right-hand sides, while the analysis phase needs to be performed only once. This is especially useful when a sparse triangular linear system must be solved for different sets of multiple right-hand sides one at a time, while its coefficient matrix remains the same.

Finally, once all the solves have completed, the opaque data structure pointed to by the csrsm2Info_t parameter can be released by calling cusparseDestroyCsrsm2Info().

### 10.1. cusparse<t>bsrmm()

cusparseStatus_t
cusparseSbsrmm(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transB,
int                      mb,
int                      n,
int                      kb,
int                      nnzb,
const float*             alpha,
const cusparseMatDescr_t descrA,
const float*             bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
const float*             B,
int                      ldb,
const float*             beta,
float*                   C,
int                      ldc)

cusparseStatus_t
cusparseDbsrmm(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transB,
int                      mb,
int                      n,
int                      kb,
int                      nnzb,
const double*            alpha,
const cusparseMatDescr_t descrA,
const double*            bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
const double*            B,
int                      ldb,
const double*            beta,
double*                  C,
int                      ldc)

cusparseStatus_t
cusparseCbsrmm(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transB,
int                      mb,
int                      n,
int                      kb,
int                      nnzb,
const cuComplex*         alpha,
const cusparseMatDescr_t descrA,
const cuComplex*         bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
const cuComplex*         B,
int                      ldb,
const cuComplex*         beta,
cuComplex*               C,
int                      ldc)

cusparseStatus_t
cusparseZbsrmm(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transB,
int                      mb,
int                      n,
int                      kb,
int                      nnzb,
const cuDoubleComplex*   alpha,
const cusparseMatDescr_t descrA,
const cuDoubleComplex*   bsrValA,
const int*               bsrRowPtrA,
const int*               bsrColIndA,
int                      blockDim,
const cuDoubleComplex*   B,
int                      ldb,
const cuDoubleComplex*   beta,
cuDoubleComplex*         C,
int                      ldc)


This function performs one of the following matrix-matrix operations:

 $C=\alpha \ast \text{op}\left(A\right)\ast \text{op}\left(B\right)+\beta \ast C$

A is an mb×kb sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA; B and C are dense matrices; are scalars; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if transA == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if transA == CUSPARSE_OPERATION_TRANSPOSE (not supported)}\\ {A}^{H}& \text{if transA == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE (not supported)}\end{array}\right\$

and

$\text{op}\left(B\right)=\left\{\begin{array}{ll}B& \text{if transB == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {B}^{T}& \text{if transB == CUSPARSE_OPERATION_TRANSPOSE}\\ {B}^{H}& \text{if transB == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE (not supported)}\end{array}\right\$
The function has the following limitations:
• Only CUSPARSE_MATRIX_TYPE_GENERAL matrix type is supported
• Only blockDim > 1 is supported
• if blockDim ≤ 4, then max(mb)/max(n) = 524,272
• if 4 < blockDim ≤ 8, then max(mb) = 524,272, max(n) = 262,136
• if blockDim > 8, then m < 65,535 and max(n) = 262,136

The motivation of transpose(B) is to improve memory access of matrix B. The computational pattern of A*transpose(B) with matrix B in column-major order is equivalent to A*B with matrix B in row-major order.

In practice, no operation in an iterative solver or eigenvalue solver uses A*transpose(B). However, we can perform A*transpose(transpose(B)) which is the same as A*B. For example, suppose A is mb*kb, B is k*n and C is m*n, the following code shows usage of cusparseDbsrmm().

// A is mb*kb, B is k*n and C is m*n
const int m = mb*blockSize;
const int k = kb*blockSize;
const int ldb_B = k; // leading dimension of B
const int ldc   = m; // leading dimension of C
// perform C:=alpha*A*B + beta*C
cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL );
cusparseDbsrmm(cusparse_handle,
CUSPARSE_DIRECTION_COLUMN,
CUSPARSE_OPERATION_NON_TRANSPOSE,
CUSPARSE_OPERATION_NON_TRANSPOSE,
mb, n, kb, nnzb, alpha,
descrA, bsrValA, bsrRowPtrA, bsrColIndA, blockSize,
B, ldb_B,
beta, C, ldc);


Instead of using A*B, our proposal is to transpose B to Bt by first calling cublas<t>geam(), and then to perform A*transpose(Bt).

// step 1: Bt := transpose(B)
const int m = mb*blockSize;
const int k = kb*blockSize;
double *Bt;
const int ldb_Bt = n; // leading dimension of Bt
cudaMalloc((void**)&Bt, sizeof(double)*ldb_Bt*k);
double one  = 1.0;
double zero = 0.0;
cublasSetPointerMode(cublas_handle, CUBLAS_POINTER_MODE_HOST);
cublasDgeam(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_T,
n, k, &one, B, int ldb_B, &zero, B, int ldb_B, Bt, ldb_Bt);

// step 2: perform C:=alpha*A*transpose(Bt) + beta*C
cusparseDbsrmm(cusparse_handle,
CUSPARSE_DIRECTION_COLUMN,
CUSPARSE_OPERATION_NON_TRANSPOSE,
CUSPARSE_OPERATION_TRANSPOSE,
mb, n, kb, nnzb, alpha,
descrA, bsrValA, bsrRowPtrA, bsrColIndA, blockSize,
Bt, ldb_Bt,
beta, C, ldc);

bsrmm() has the following properties:
• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
Input
 handle handle to the cuSPARSE library context. dir storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN. transA the operation op(A). transB the operation op(B). mb number of block rows of sparse matrix A. n number of columns of dense matrix op(B) and A. kb number of block columns of sparse matrix A. nnzb number of non-zero blocks of sparse matrix A. alpha scalar used for multiplication. descrA the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE. bsrValA array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ nonzero blocks of matrix A. bsrRowPtrA integer array of mb$+1$ elements that contains the start of every block row and the end of the last block row plus one. bsrColIndA integer array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ column indices of the nonzero blocks of matrix A. blockDim block dimension of sparse matrix A, larger than zero. B array of dimensions (ldb, n) if op(B)=B and (ldb, k) otherwise. ldb leading dimension of B. If op(B)=B, it must be at least $max\text{(1, k)}$ If op(B) != B, it must be at least max(1, n). beta scalar used for multiplication. If beta is zero, C does not have to be a valid input. C array of dimensions (ldc, n). ldc leading dimension of C. It must be at least $max\text{(1, m)}$ if op(A)=A and at least $max\text{(1, k)}$ otherwise.
Output
 C updated array of dimensions (ldc, n).

See cusparseStatus_t for the description of the return status

### 10.2. cusparse<t>bsrsm2_bufferSize()

cusparseStatus_t
cusparseSbsrsm2_bufferSize(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transX,
int                      mb,
int                      n,
int                      nnzb,
const cusparseMatDescr_t descrA,
float*                   bsrSortedValA,
const int*               bsrSortedRowPtrA,
const int*               bsrSortedColIndA,
int                      blockDim,
bsrsm2Info_t             info,
int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseDbsrsm2_bufferSize(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transX,
int                      mb,
int                      n,
int                      nnzb,
const cusparseMatDescr_t descrA,
double*                  bsrSortedValA,
const int*               bsrSortedRowPtrA,
const int*               bsrSortedColIndA,
int                      blockDim,
bsrsm2Info_t             info,
int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseCbsrsm2_bufferSize(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transX,
int                      mb,
int                      n,
int                      nnzb,
const cusparseMatDescr_t descrA,
cuComplex*               bsrSortedValA,
const int*               bsrSortedRowPtrA,
const int*               bsrSortedColIndA,
int                      blockDim,
bsrsm2Info_t             info,
int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseZbsrsm2_bufferSize(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transX,
int                      mb,
int                      n,
int                      nnzb,
const cusparseMatDescr_t descrA,
cuDoubleComplex*         bsrSortedValA,
const int*               bsrSortedRowPtrA,
const int*               bsrSortedColIndA,
int                      blockDim,
bsrsm2Info_t             info,
int*                     pBufferSizeInBytes)


This function returns size of buffer used in bsrsm2(), a new sparse triangular linear system op(A)*op(X)=$\alpha$op(B).

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA); B and X are the right-hand-side and the solution matrices; $\alpha$ is a scalar; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if trans == CUSPARSE_OPERATION_TRANSPOSE}\\ {A}^{H}& \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE}\end{array}\right\$

Although there are six combinations in terms of parameter trans and the upper (and lower) triangular part of A, bsrsm2_bufferSize() returns the maximum size of the buffer among these combinations. The buffer size depends on dimension mb,blockDim and the number of nonzeros of the matrix, nnzb. If the user changes the matrix, it is necessary to call bsrsm2_bufferSize() again to get the correct buffer size, otherwise a segmentation fault may occur.

• The routine requires no extra storage
• The routine supports asynchronous execution
• The routine supports CUDA graph capture
Input
 handle handle to the cuSPARSE library context. dirA storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN. transA the operation op(A). transX the operation op(X). mb number of block rows of matrix A. n number of columns of matrix op(B) and op(X). nnzb number of nonzero blocks of matrix A. descrA the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT. bsrValA array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ nonzero blocks of matrix A. bsrRowPtrA integer array of mb$+1$ elements that contains the start of every block row and the end of the last block row plus one. bsrColIndA integer array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ column indices of the nonzero blocks of matrix A. blockDim block dimension of sparse matrix A; larger than zero.
Output
 info record internal states based on different algorithms. pBufferSizeInBytes number of bytes of the buffer used in bsrsm2_analysis() and bsrsm2_solve().

See cusparseStatus_t for the description of the return status

### 10.3. cusparse<t>bsrsm2_analysis()

cusparseStatus_t
cusparseSbsrsm2_analysis(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transX,
int                      mb,
int                      n,
int                      nnzb,
const cusparseMatDescr_t descrA,
const float*             bsrSortedVal,
const int*               bsrSortedRowPtr,
const int*               bsrSortedColInd,
int                      blockDim,
bsrsm2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseDbsrsm2_analysis(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transX,
int                      mb,
int                      n,
int                      nnzb,
const cusparseMatDescr_t descrA,
const double*            bsrSortedVal,
const int*               bsrSortedRowPtr,
const int*               bsrSortedColInd,
int                      blockDim,
bsrsm2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseCbsrsm2_analysis(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transX,
int                      mb,
int                      n,
int                      nnzb,
const cusparseMatDescr_t descrA,
const cuComplex*         bsrSortedVal,
const int*               bsrSortedRowPtr,
const int*               bsrSortedColInd,
int                      blockDim,
bsrsm2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)

cusparseStatus_t
cusparseZbsrsm2_analysis(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transX,
int                      mb,
int                      n,
int                      nnzb,
const cusparseMatDescr_t descrA,
const cuDoubleComplex*   bsrSortedVal,
const int*               bsrSortedRowPtr,
const int*               bsrSortedColInd,
int                      blockDim,
bsrsm2Info_t             info,
cusparseSolvePolicy_t    policy,
void*                    pBuffer)


This function performs the analysis phase of bsrsm2(), a new sparse triangular linear system op(A)*op(X) =$\alpha$op(B).

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA); B and X are the right-hand-side and the solution matrices; $\alpha$ is a scalar; and

$\text{op}\left(A\right)=\left\{\begin{array}{ll}A& \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {A}^{T}& \text{if trans == CUSPARSE_OPERATION_TRANSPOSE}\\ {A}^{H}& \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE}\end{array}\right\$

and

$\text{op}\left(X\right)=\left\{\begin{array}{ll}X& \text{if transX == CUSPARSE_OPERATION_NON_TRANSPOSE}\\ {X}^{T}& \text{if transX == CUSPARSE_OPERATION_TRANSPOSE}\\ {X}^{H}& \text{if transX == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE (not supported)}\end{array}\right\$

and op(B) and op(X) are equal.

The block of BSR format is of size blockDim*blockDim, stored in column-major or row-major as determined by parameter dirA, which is either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

It is expected that this function will be executed only once for a given matrix and a particular operation type.

This function requires the buffer size returned by bsrsm2_bufferSize(). The address of pBuffer must be multiple of 128 bytes. If not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function bsrsm2_analysis() reports a structural zero and computes the level information stored in opaque structure info. The level information can extract more parallelism during a triangular solver. However bsrsm2_solve() can be done without level information. To disable level information, the user needs to specify the policy of the triangular solver as CUSPARSE_SOLVE_POLICY_NO_LEVEL.

Function bsrsm2_analysis() always reports the first structural zero, even if the parameter policy is CUSPARSE_SOLVE_POLICY_NO_LEVEL. Besides, no structural zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even if block A(j,j) is missing for some j. The user must call cusparseXbsrsm2_query_zero_pivot() to know where the structural zero is.

If bsrsm2_analysis() reports a structural zero, the solve will return a numerical zero in the same position as the structural zero but this result X is meaningless.

• This function requires temporary extra storage that is allocated internally
• The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available
• The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available
Input
 handle handle to the cuSPARSE library context. dirA storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN. transA the operation op(A). transX the operation op(B) and op(X). mb number of block rows of matrix A. n number of columns of matrix op(B) and op(X). nnzb number of non-zero blocks of matrix A. descrA the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT. bsrValA array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ nonzero blocks of matrix A. bsrRowPtrA integer array of mb$+1$ elements that contains the start of every block row and the end of the last block row plus one. bsrColIndA integer array of nnzb$\left(=$bsrRowPtrA(mb)$-$bsrRowPtrA(0)$\right)$ column indices of the nonzero blocks of matrix A. blockDim block dimension of sparse matrix A; larger than zero. info structure initialized using cusparseCreateBsrsm2Info. policy The supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL. pBuffer buffer allocated by the user; the size is return by bsrsm2_bufferSize().
Output
 info structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged).

See cusparseStatus_t for the description of the return status

### 10.4. cusparse<t>bsrsm2_solve()

cusparseStatus_t
cusparseSbsrsm2_solve(cusparseHandle_t         handle,
cusparseDirection_t      dirA,
cusparseOperation_t      transA,
cusparseOperation_t      transX,
int                      mb,
int                      n,
int                      nnzb,
const float*             alpha,
const cusparseMatDescr_t descrA,
const float*             bsrSortedVal,
const int*               bsrSortedRowPtr,
const int*               bsrSortedColInd,
int                      blockDim,
bsrsm2Info_t             info,