cuSPARSE

The API reference guide for cuSPARSE, the CUDA sparse matrix library.

1. Introduction

The cuSPARSE library contains a set of GPU-accelerated basic linear algebra subroutines used for handling sparse matrices that perform significantly faster than CPU-only alternatives. Depending on the specific operation, the library targets matrices with sparsity ratios in the range between 70%-99.9%.
It is implemented on top of the NVIDIA® CUDA™ runtime (which is part of the CUDA Toolkit) and is designed to be called from C and C++.

cuSPARSE Release Notes: cuda-toolkit-release-notes

cuSPARSE GitHub Samples: CUDALibrarySamples

Nvidia Developer Forum: GPU-Accelerated Libraries

Provide Feedback: Math-Libs-Feedback@nvidia.com

Recent cuSPARSE/cuSPARSELt Blog Posts and GTC presentations:

The library routines provide the following functionalities:

  • Operations between a sparse vector and a dense vector: sum, dot product, scatter, gather

  • Operations between a dense matrix and a sparse vector: multiplication

  • Operations between a sparse matrix and a dense vector: multiplication, triangular solver, tridiagonal solver, pentadiagonal solver

  • Operations between a sparse matrix and a dense matrix: multiplication, triangular solver, tridiagonal solver, pentadiagonal solver

  • Operations between a sparse matrix and a sparse matrix: sum, multiplication

  • Operations between dense matrices with output a sparse matrix: multiplication

  • Sparse matrix preconditioners: Incomplete Cholesky Factorization (level 0), Incomplete LU Factorization (level 0)

  • Reordering and Conversion operations between different sparse matrix storage formats


1.1. Library Organization and Features

The cuSPARSE library is organized in two set of APIs:

  • The Legacy APIs, inspired by the Sparse BLAS standard, provide a limited set of functionalities and will not be improved in future releases, even if standard maintenance is still ensured. Some routines in this category could be deprecated and removed in the short-term. A replacement will be provided for the most important of them during the deprecation process.

  • The Generic APIs provide the standard interface layer of cuSPARSE. They allow computing the most common sparse linear algebra operations, such as sparse matrix-vector (SpMV) and sparse matrix-matrix multiplication (SpMM), in a flexible way. The new APIs have the following capabilities and features:

    • Set matrix data layouts, number of batches, and storage formats (for example, CSR, COO, and so on).

    • Set input/output/compute data types. This also allows mixed data-type computation.

    • Set types of sparse vector/matrix indices (e.g. 32-bit, 64-bit).

    • Choose the algorithm for the computation.

    • Guarantee external device memory for internal operations.

    • Provide extensive consistency checks across input matrices and vectors. This includes the validation of sizes, data types, layout, allowed operations, etc.

    • Provide constant descriptors for vector and matrix inputs to support const-safe interface and guarantee that the APIs do not modify their inputs.


1.2. Static Library Support

Starting with CUDA 6.5, the cuSPARSE library is also delivered in a static form as libcusparse_static.a on Linux.

For example, to compile a small application using cuSPARSE against the dynamic library, the following command can be used:

nvcc my_cusparse_app.cu  -lcusparse  -o my_cusparse_app

Whereas to compile against the static library, the following command has to be used:

nvcc my_cusparse_app.cu  -lcusparse_static -o my_cusparse_app

It is also possible to use the native Host C++ compiler. Depending on the Host Operating system, some additional libraries like pthread or dl might be needed on the linking line. The following command on Linux is suggested:

gcc my_cusparse_app.c -lcusparse_static -lcudart_static -lpthread -ldl -I <cuda-toolkit-path>/include -L <cuda-toolkit-path>/lib64 -o my_cusparse_app

Note that in the latter case, the library cuda is not needed. The CUDA Runtime will try to open explicitly the cuda library if needed. In the case of a system which does not have the CUDA driver installed, this allows the application to gracefully manage this issue and potentially run if a CPU-only path is available.


1.3. Library Dependencies

Starting with CUDA 12.0, cuSPARSE will depend on nvJitLink library for JIT (Just-In-Time) LTO (Link-Time-Optimization) capabilities; refer to the cusparseSpMMOp APIs for more information.

If the user links to the dynamic library, the environment variables for loading the libraries at run-time (such as LD_LIBRARY_PATH on Linux and PATH on Windows) must include the path where libnvjitlink.so is located. If it is in the same directory as cuSPARSE, the user doesn’t need to take any action.

If linking to the static library, the user needs to link with -lnvjitlink and set the environment variables for loading the libraries at compile-time LIBRARY_PATH/PATH accordingly.



2. Using the cuSPARSE API

This chapter describes how to use the cuSPARSE library API. It is not a reference for the cuSPARSE API data types and functions; that is provided in subsequent chapters.

2.1. APIs Usage Notes

The cuSPARSE library allows developers to access the computational resources of the NVIDIA graphics processing unit (GPU).

  • The cuSPARSE APIs assume that input and output data (vectors and matrices) reside in GPU (device) memory.

  • The input and output scalars (e.g. \(\alpha\) and \(\beta\)) can be passed by reference on the host or the device, instead of only being allowed to be passed by value on the host. This allows library functions to execute asynchronously using streams even when they are generated by a previous kernel resulting in maximum parallelism.

  • The handle to the cuSPARSE library context is initialized using the function and is explicitly passed to every subsequent library function call. This allows the user to have more control over the library setup when using multiple host threads and multiple GPUs.

  • The error status cusparseStatus_t is returned by all cuSPARSE library function calls.

It is the responsibility of the developer to allocate memory and to copy data between GPU memory and CPU memory using standard CUDA runtime API routines, such as cudaMalloc(), cudaFree(), cudaMemcpy(), and cudaMemcpyAsync().

The cuSPARSE library functions are executed asynchronously with respect to the host and may return control to the application on the host before the result is ready. Developers can use the cudaDeviceSynchronize() function to ensure that the execution of a particular cuSPARSE library routine has completed.

A developer can also use the cudaMemcpy() routine to copy data from the device to the host and vice versa, using the cudaMemcpyDeviceToHost and cudaMemcpyHostToDevice parameters, respectively. In this case there is no need to add a call to cudaDeviceSynchronize() because the call to cudaMemcpy() with the above parameters is blocking and completes only when the results are ready on the host.


2.2. Deprecated APIs

The cuSPARSE library documentation explicitly indicates the set of APIs/enumerators/data structures that are deprecated. The library policy for deprecated APIs is the following:

  1. An API is marked [[DEPRECATED]] on a release X.Y (e.g. 11.2)

    • The documentation indices a replacement if available

    • Otherwise, the functionality will not be maintained in the future

  2. The API will be removed in the release X+1.0 (e.g. 12.0)

Correctness bugs are still addressed even for deprecated APIs, while performance issues are not always ensured.

In addition to the documentation, deprecated APIs generate a compile-time warning for most platforms when used. Deprecation warnings can be disabled by defining the macro DISABLE_CUSPARSE_DEPRECATED before including cusparse.h or by passing the flag -DDISABLE_CUSPARSE_DEPRECATED to the compiler.

2.3. Thread Safety

The library is thread safe and its functions can be called from multiple host threads, even with the same handle. When multiple threads share the same handle, extreme care needs to be taken when the handle configuration is changed because that change will affect potentially subsequent cuSPARSE calls in all threads. It is even more true for the destruction of the handle. So it is not recommended that multiple thread share the same cuSPARSE handle.


2.4. Result Reproducibility

The design of cuSPARSE prioritizes performance over bit-wise reproducibility.

Operations using transpose or conjugate-transpose cusparseOperation_t have no reproducibility guarantees.

For the remaining operations, performing the same API call twice with the exact same arguments, on the same machine, with the same executable will produce bit-wise identical results. This bit-wise reproducibility can be disrupted by changes to: hardware, CUDA drivers, cuSPARSE version, memory alignment of the data, or algorithm selection.


2.5. NaN and Inf Propagation

Floating-point numbers have special values for NaN (not-a-number) and Inf (infinity). Functions in cuSPARSE make no guarantees about the propagation of NaN and Inf.

The cuSPARSE algorithms evaluate assuming all finite floating-point values. NaN and Inf appear in the output only if the algorithms happen to generate or propagate them. Because the algorithms are subject to change based on toolkit version and runtime considerations, so too are the propagation behaviours of NaN and Inf.

NaN propagation is different in cuSPARSE than in typical dense numerical linear algebra, such as cuBLAS. The dot product between vectors [0, 1, 0] and [1, 1, NaN] is NaN when using typical dense numerical algorithms, but will be 1.0 with typical sparse numerical algorithms.


2.6. Parallelism with Streams

If the application performs several small independent computations, or if it makes data transfers in parallel with the computation, CUDA streams can be used to overlap these tasks.

The application can conceptually associate a stream with each task. To achieve the overlap of computation between the tasks, the developer should create CUDA streams using the function cudaStreamCreate() and set the stream to be used by each individual cuSPARSE library routine by calling cusparseSetStream() just before calling the actual cuSPARSE routine. Then, computations performed in separate streams would be overlapped automatically on the GPU, when possible. This approach is especially useful when the computation performed by a single task is relatively small and is not enough to fill the GPU with work, or when there is a data transfer that can be performed in parallel with the computation.

When streams are used, we recommend using the new cuSPARSE API with scalar parameters and results passed by reference in the device memory to achieve maximum computational overlap.

Although a developer can create many streams, in practice it is not possible to have more than 16 concurrent kernels executing at the same time.


2.7. Compatibility and Versioning

The cuSPARSE APIs are intended to be backward compatible at the source level with future releases (unless stated otherwise in the release notes of a specific future release). In other words, if a program uses cuSPARSE, it should continue to compile and work correctly with newer versions of cuSPARSE without source code changes. cuSPARSE is not guaranteed to be backward compatible at the binary level. Using different versions of the cusparse.h header file and shared library is not supported. Using different versions of cuSPARSE and the CUDA runtime is not supported.


The library uses the standard version semantic convention for identify different releases.

The version takes the form of four fields joined by periods: MAJOR.MINOR.PATCH.BUILD

These version fields are incremented based on the following rules:

  • MAJOR: API breaking changes or new CUDA major version (breaking changes at lower level, e.g. drivers, compilers, libraries)

  • MINOR: new APIs and functionalities

  • PATCH: Bug fixes or performance improvements (or * new CUDA release)

  • BUILD: Internal build number

* Different CUDA toolkit releases ensure distinct library versions even if there are no changes at library level.


2.8. Optimization Notes

Most of the cuSPARSE routines can be optimized by exploiting CUDA Graphs capture and Hardware Memory Compression features.

More in details, a single cuSPARSE call or a sequence of calls can be captured by a CUDA Graph and executed in a second moment. This minimizes kernels launch overhead and allows the CUDA runtime to optimize the whole workflow. A full example of CUDA graphs capture applied to a cuSPARSE routine can be found in cuSPARSE Library Samples - CUDA Graph.

Secondly, the data types and functionalities involved in cuSPARSE are suitable for Hardware Memory Compression available in Ampere GPU devices (compute capability 8.0) or above. The feature allows memory compression for data with enough zero bytes without no loss of information. The device memory must be allocation with the CUDA driver APIs. A full example of Hardware Memory Compression applied to a cuSPARSE routine can be found in cuSPARSE Library Samples - Memory Compression.



3. cuSPARSE Storage Formats

The cuSPARSE library supports dense and sparse vector, and dense and sparse matrix formats.

3.1. Index Base

The library supports zero- and one-based indexing to ensure the compatibility with C/C++ and Fortran languages respectively. The index base is selected through the cusparseIndexBase_t type.


3.2. Vector Formats

This section describes dense and sparse vector formats.

3.2.1. Dense Vector Format

Dense vectors are represented with a single data array that is stored linearly in memory, such as the following \(7 \times 1\) dense vector.

_images/dense_vector.png

Dense vector representation


3.2.2. Sparse Vector Format

Sparse vectors are represented with two arrays.

  • The values array stores the nonzero values from the equivalent array in dense format.

  • The indices array represent the positions of the corresponding nonzero values in the equivalent array in dense format.

For example, the dense vector in section 3.2.1 can be stored as a sparse vector with zero-based or one-based indexing.

_images/sparse_vector.png

Sparse vector representation

Note

The cuSPARSE routines assume that the indices are provided in increasing order and that each index appears only once. In the opposite case, the correctness of the computation is not always ensured.


3.3. Matrix Formats

Dense and several sparse formats for matrices are discussed in this section.

3.3.1. Dense Matrix Format

A dense matrix can be stored in both row-major and column-major memory layout (ordering) and it is represented by the following parameters.

  • The number of rows in the matrix.

  • The number of columns in the matrix.

  • The leading dimension, which must be

    • Greater than or equal to the number of columns in the row-major layout

    • Greater than or equal to the number of rows in the column-major layout

  • The pointers to the values array of length

    • \(rows \times leading\; dimension\) in the row-major layout

    • \(columns \times leading\; dimension\) in the column-major layout

The following figure represents a \(5 \times 2\) dense matrix with both memory layouts

_images/dense_matrix.png

Dense matrix representations

The indices within the matrix represents the contiguous locations in memory.

The leading dimension is useful to represent a sub-matrix within the original one

_images/sub_matrix.png

Sub-matrix representations


3.3.2. Coordinate (COO)

A sparse matrix stored in COO format is represented by the following parameters.

  • The number of rows in the matrix.

  • The number of columns in the matrix.

  • The number of non-zero elements (nnz) in the matrix.

  • The pointers to the row indices array of length nnz that contains the row indices of the corresponding elements in the values array.

  • The pointers to the column indices array of length nnz that contains the column indices of the corresponding elements in the values array.

  • The pointers to the values array of length nnz that holds all nonzero values of the matrix in row-major ordering.

  • Each entry of the COO representation consists of a <row, column> pair.

  • The COO format is assumed to be sorted by row.

The following example shows a \(5 \times 4\) matrix represented in COO format.

_images/coo.png
_images/coo_one_base.png

Note

cuSPARSE supports both sorted and unsorted column indices within a given row.

Note

If the column indices within a given row are not unique, the correctness of the computation is not always ensured.

Given an entry in the COO format (zero-base), the corresponding position in the dense matrix is computed as:

// row-major
rows_indices[i] * leading_dimension + column_indices[i]

// column-major
column_indices[i] * leading_dimension + rows_indices[i]

3.3.3. Compressed Sparse Row (CSR)

The CSR format is similar to COO, where the row indices are compressed and replaced by an array of offsets.

A sparse matrix stored in CSR format is represented by the following parameters.

  • The number of rows in the matrix.

  • The number of columns in the matrix.

  • The number of non-zero elements (nnz) in the matrix.

  • The pointers to the row offsets array of length number of rows + 1 that represents the starting position of each row in the columns and values arrays.

  • The pointers to the column indices array of length nnz that contains the column indices of the corresponding elements in the values array.

  • The pointers to the values array of length nnz that holds all nonzero values of the matrix in row-major ordering.

The following example shows a \(5 \times 4\) matrix represented in CSR format.

_images/csr.png
_images/csr_one_base.png

Note

cuSPARSE supports both sorted and unsorted column indices within a given row.

Note

If the column indices within a given row are not unique, the correctness of the computation is not always ensured.

Given an entry in the CSR format (zero-base), the corresponding position in the dense matrix is computed as:

// row-major
row * leading_dimension + column_indices[row_offsets[row] + k]

// column-major
column_indices[row_offsets[row] + k] * leading_dimension + row

3.3.4. Compressed Sparse Column (CSC)

The CSC format is similar to COO, where the column indices are compressed and replaced by an array of offsets.

A sparse matrix stored in CSC format is represented by the following parameters.

  • The number of rows in the matrix.

  • The number of columns in the matrix.

  • The number of non-zero elements (nnz) in the matrix.

  • The pointers to the column offsets array of length number of column + 1 that represents the starting position of each column in the columns and values arrays.

  • The pointers to the row indices array of length nnz that contains row indices of the corresponding elements in the values array.

  • The pointers to the values array of length nnz that holds all nonzero values of the matrix in column-major ordering.

The following example shows a \(5 \times 4\) matrix represented in CSC format.

_images/csc.png
_images/csc_one_base.png

Note

The CSR format has exactly the same memory layout as its transpose in CSC format (and vice versa).

Note

cuSPARSE supports both sorted and unsorted row indices within a given column.

Note

If the row indices within a given column are not unique, the correctness of the computation is not always ensured.

Given an entry in the CSC format (zero-base), the corresponding position in the dense matrix is computed as:

// row-major
column * leading_dimension + row_indices[column_offsets[column] + k]

// column-major
row_indices[column_offsets[column] + k] * leading_dimension + column

3.3.5. Sliced Ellpack (SELL)

The Sliced Ellpack format is standardized and well-known as the state of the art. This format allows to significantly improve the performance of all problems that involve low variability in the number of nonzero elements per row.

A matrix in the Sliced Ellpack format is divided into slices of an exact number of rows (\(sliceSize\)), defined by the user. The maximum row length (i.e., the maximum non-zeros per row) is found for each slice, and every row in the slice is padded to the maximum row length. The value -1 is used for padding.

A \(m \times n\) sparse matrix \(A\) is equivalent to a sliced sparse matrix \(A_{s}\) with \(nslices = \left \lceil{\frac{m}{sliceSize}}\right \rceil\) slice rows and \(n\) columns. To improve memory coalescing and memory utilization, each slice is stored in column-major order.

A sparse matrix stored in SELL format is represented by the following parameters.

  • The number of slices.

  • The number of rows in the matrix.

  • The number of columns in the matrix.

  • The number of non-zero elements (nnz) in the matrix.

  • The total number elements (sellValuesSize), including non-zero values and padded elements.

  • The pointer to the slice offsets of length \(nslices + 1\) that holds offsets of the slides corresponding to the columns and values arrays.

  • The pointer to the column indices array of length sellValuesSize that contains column indices of the corresponding elements in the values array. The column indices are stored in column-major layout. Value -1 refers to padding.

  • The pointer to the values array of length sellValuesSize that holds all non-zero values and padding in column-major layout.

The following example shows a \(5 \times 4\) matrix represented in SELL format.

_images/sell.png
_images/sell_one_base.png

3.3.6. Block Sparse Row (BSR)

The BSR format is similar to CSR, where the column indices represent two-dimensional blocks instead of a single matrix entry.

A matrix in the Block Sparse Row format is organized into blocks of size \(blockSize\), defined by the user.

A \(m \times n\) sparse matrix \(A\) is equivalent to a block sparse matrix \(A_{B}\): \(mb \times nb\) with \(mb = \frac{m}{blockSize}\) block rows and \(nb = \frac{n}{blockSize}\) block columns. If \(m\) or \(n\) is not multiple of \(blockSize\), the user needs to pad the matrix with zeros.

Note

cuSPARSE currently supports only square blocks.

The BSR format stores the blocks in row-major ordering. However, the internal storage format of blocks can be column-major (cusparseDirection_t=CUSPARSE_DIRECTION_COLUMN) or row-major (cusparseDirection_t=CUSPARSE_DIRECTION_ROW), independently of the base index.

A sparse matrix stored in BSR format is represented by the following parameters.

  • The block size.

  • The number of row blocks in the matrix.

  • The number of column blocks in the matrix.

  • The number of non-zero blocks (nnzb) in the matrix.

  • The pointers to the row block offsets array of length number of row blocks + 1 that represents the starting position of each row block in the columns and values arrays.

  • The pointers to the column block indices array of length nnzb that contains the location of the corresponding elements in the values array.

  • The pointers to the values array of length nnzb that holds all nonzero values of the matrix.

The following example shows a \(4 \times 7\) matrix represented in BSR format.

_images/bsr.png
_images/bsr_one_base.png

3.3.7. Blocked Ellpack (BLOCKED-ELL)

The Blocked Ellpack format is similar to the standard Ellpack, where the column indices represent two-dimensional blocks instead of a single matrix entry.

A matrix in the Blocked Ellpack format is organized into blocks of size \(blockSize\), defined by the user. The number of columns per row \(nEllCols\) is also defined by the user (\(nEllCols \le n\)).

A \(m \times n\) sparse matrix \(A\) is equivalent to a Blocked-ELL matrix \(A_{B}\): \(mb \times nb\) with \(mb = \left \lceil{\frac{m}{blockSize}}\right \rceil\) block rows, and \(nb = \left \lceil{\frac{nEllCols}{blockSize}}\right \rceil\) block columns. If \(m\) or \(n\) is not multiple of \(blockSize\), then the remaining elements are zero.

A sparse matrix stored in Blocked-ELL format is represented by the following parameters.

  • The block size.

  • The number of rows in the matrix.

  • The number of columns in the matrix.

  • The number of columns per row (nEllCols) in the matrix.

  • The pointers to the column block indices array of length \(mb \times nb\) that contains the location of the corresponding elements in the values array. Empty blocks can be represented with -1 index.

  • The pointers to the values array of length \(m \times nEllCols\) that holds all nonzero values of the matrix in row-major ordering.

The following example shows a \(9 \times 9\) matrix represented in Blocked-ELL format.

_images/blockedell.png
_images/blockedell_one_base.png

3.3.8. Extended BSR Format (BSRX) [DEPRECATED]

BSRX is the same as the BSR format, but the array bsrRowPtrA is separated into two parts. The first nonzero block of each row is still specified by the array bsrRowPtrA, which is the same as in BSR, but the position next to the last nonzero block of each row is specified by the array bsrEndPtrA. Briefly, BSRX format is simply like a 4-vector variant of BSR format.

Matrix A is represented in BSRX format by the following parameters.

blockDim

(integer)

Block dimension of matrix A.

mb

(integer)

The number of block rows of A.

nb

(integer)

The number of block columns of A.

nnzb

(integer)

number of nonzero blocks in the matrix A.

bsrValA

(pointer)

Points to the data array of length \(nnzb \ast blockDim^{2}\) that holds all the elements of the nonzero blocks of A. The block elements are stored in either column-major order or row-major order.

bsrRowPtrA

(pointer)

Points to the integer array of length mb that holds indices into the arrays bsrColIndA and bsrValA; bsrRowPtrA(i) is the position of the first nonzero block of the ith block row in bsrColIndA and bsrValA.

bsrEndPtrA

(pointer)

Points to the integer array of length mb that holds indices into the arrays bsrColIndA and bsrValA; bsrRowPtrA(i) is the position next to the last nonzero block of the ith block row in bsrColIndA and bsrValA.

bsrColIndA

(pointer)

Points to the integer array of length nnzb that contains the column indices of the corresponding blocks in array bsrValA.

A simple conversion between BSR and BSRX can be done as follows. Suppose the developer has a 2×3 block sparse matrix \(A_{b}\) represented as shown.

\(A_{b} = \begin{bmatrix} A_{00} & A_{01} & A_{02} \\ A_{10} & A_{11} & A_{12} \\ \end{bmatrix}\)

Assume it has this BSR format.

\(\begin{matrix} \text{bsrValA of BSR} & = & \begin{bmatrix} A_{00} & A_{01} & A_{10} & A_{11} & A_{12} \\ \end{bmatrix} \\ \text{bsrRowPtrA of BSR} & = & \begin{bmatrix} {0\phantom{.0}} & {2\phantom{.0}} & 5 \\ \end{bmatrix} \\ \text{bsrColIndA of BSR} & = & \begin{bmatrix} {0\phantom{.0}} & {1\phantom{.0}} & {0\phantom{.0}} & {1\phantom{.0}} & 2 \\ \end{bmatrix} \\ \end{matrix}\)

The bsrRowPtrA of the BSRX format is simply the first two elements of the bsrRowPtrA BSR format. The bsrEndPtrA of BSRX format is the last two elements of the bsrRowPtrA of BSR format.

\(\begin{matrix} \text{bsrRowPtrA of BSRX} & = & \begin{bmatrix} {0\phantom{.0}} & 2 \\ \end{bmatrix} \\ \text{bsrEndPtrA of BSRX} & = & \begin{bmatrix} {2\phantom{.0}} & 5 \\ \end{bmatrix} \\ \end{matrix}\)

The advantage of the BSRX format is that the developer can specify a submatrix in the original BSR format by modifying bsrRowPtrA and bsrEndPtrA while keeping bsrColIndA and bsrValA unchanged.

For example, to create another block matrix \(\widetilde{A} = \begin{bmatrix} O & O & O \\ O & A_{11} & O \\ \end{bmatrix}\) that is slightly different from \(A\) , the developer can keep bsrColIndA and bsrValA, but reconstruct \(\widetilde{A}\) by properly setting of bsrRowPtrA and bsrEndPtrA. The following 4-vector characterizes \(\widetilde{A}\) .

\(\begin{matrix} {\text{bsrValA of }\widetilde{A}} & = & \begin{bmatrix} A_{00} & A_{01} & A_{10} & A_{11} & A_{12} \\ \end{bmatrix} \\ {\text{bsrColIndA of }\widetilde{A}} & = & \begin{bmatrix} {0\phantom{.0}} & {1\phantom{.0}} & {0\phantom{.0}} & {1\phantom{.0}} & 2 \\ \end{bmatrix} \\ {\text{bsrRowPtrA of }\widetilde{A}} & = & \begin{bmatrix} {0\phantom{.0}} & 3 \\ \end{bmatrix} \\ {\text{bsrEndPtrA of }\widetilde{A}} & = & \begin{bmatrix} {0\phantom{.0}} & 4 \\ \end{bmatrix} \\ \end{matrix}\)



4. cuSPARSE Basic APIs

4.1. cuSPARSE Types Reference

4.1.1. cudaDataType_t

The section describes the types shared by multiple CUDA Libraries and defined in the header file library_types.h. The cudaDataType type is an enumerator to specify the data precision. It is used when the data reference does not carry the type itself (e.g. void*). For example, it is used in the routine cusparseSpMM().

Value

Meaning

Data Type

Header

CUDA_R_16F

The data type is 16-bit IEEE-754 floating-point

__half

cuda_fp16.h

CUDA_C_16F

The data type is 16-bit complex IEEE-754 floating-point

__half2

cuda_fp16.h

[DEPRECATED]

CUDA_R_16BF

The data type is 16-bit bfloat floating-point

__nv_bfloat16

cuda_bf16.h

CUDA_C_16BF

The data type is 16-bit complex bfloat floating-point

__nv_bfloat162

cuda_bf16.h

[DEPRECATED]

CUDA_R_32F

The data type is 32-bit IEEE-754 floating-point

float

CUDA_C_32F

The data type is 32-bit complex IEEE-754 floating-point

cuComplex

cuComplex.h

CUDA_R_64F

The data type is 64-bit IEEE-754 floating-point

double

CUDA_C_64F

The data type is 64-bit complex IEEE-754 floating-point

cuDoubleComplex

cuComplex.h

CUDA_R_8I

The data type is 8-bit integer

int8_t

stdint.h

CUDA_R_32I

The data type is 32-bit integer

int32_t

stdint.h

IMPORTANT: The Generic API routines allow all data types reported in the respective section of the documentation only on GPU architectures with native support for them. If a specific GPU model does not provide native support for a given data type, the routine returns CUSPARSE_STATUS_ARCH_MISMATCH error.

Unsupported data types and Compute Capability (CC):

  • __half on GPUs with CC < 53 (e.g. Kepler)

  • __nv_bfloat16 on GPUs with CC < 80 (e.g. Kepler, Maxwell, Pascal, Volta, Turing)

see https://developer.nvidia.com/cuda-gpus


4.1.2. cusparseStatus_t

This data type represents the status returned by the library functions and it can have the following values

Value

Description

CUSPARSE_STATUS_SUCCESS

The operation completed successfully

CUSPARSE_STATUS_NOT_INITIALIZED

The cuSPARSE library was not initialized. This is usually caused by the lack of a prior call, an error in the CUDA Runtime API called by the cuSPARSE routine, or an error in the hardware setup

To correct: call cusparseCreate() prior to the function call; and check that the hardware, an appropriate version of the driver, and the cuSPARSE library are correctly installed

The error also applies to generic APIs ( Generic APIs reference) for indicating a matrix/vector descriptor not initialized

CUSPARSE_STATUS_ALLOC_FAILED

Resource allocation failed inside the cuSPARSE library. This is usually caused by a device memory allocation (cudaMalloc()) or by a host memory allocation failure

To correct: prior to the function call, deallocate previously allocated memory as much as possible

CUSPARSE_STATUS_INVALID_VALUE

An unsupported value or parameter was passed to the function (a negative vector size, for example)

To correct: ensure that all the parameters being passed have valid values

CUSPARSE_STATUS_ARCH_MISMATCH

The function requires a feature absent from the device architecture

To correct: compile and run the application on a device with appropriate compute capability

CUSPARSE_STATUS_EXECUTION_FAILED

The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons

To correct: check that the hardware, an appropriate version of the driver, and the cuSPARSE library are correctly installed

CUSPARSE_STATUS_INTERNAL_ERROR

An internal cuSPARSE operation failed

To correct: check that the hardware, an appropriate version of the driver, and the cuSPARSE library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine completion

CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED

The matrix type is not supported by this function. This is usually caused by passing an invalid matrix descriptor to the function

To correct: check that the fields in cusparseMatDescr_t descrA were set correctly

CUSPARSE_STATUS_NOT_SUPPORTED

The operation or data type combination is currently not supported by the function

CUSPARSE_STATUS_INSUFFICIENT_RESOURCES

The resources for the computation, such as GPU global or shared memory, are not sufficient to complete the operation. The error can also indicate that the current computation mode (e.g. bit size of sparse matrix indices) does not allow to handle the given input


4.1.3. cusparseHandle_t

This is a pointer type to an opaque cuSPARSE context, which the user must initialize by calling prior to calling cusparseCreate() any other library function. The handle created and returned by cusparseCreate() must be passed to every cuSPARSE function.


4.1.4. cusparsePointerMode_t

This type indicates whether the scalar values are passed by reference on the host or device. It is important to point out that if several scalar values are passed by reference in the function call, all of them will conform to the same single pointer mode. The pointer mode can be set and retrieved using cusparseSetPointerMode() and cusparseGetPointerMode() routines, respectively.

Value

Meaning

CUSPARSE_POINTER_MODE_HOST

The scalars are passed by reference on the host.

CUSPARSE_POINTER_MODE_DEVICE

The scalars are passed by reference on the device.


4.1.5. cusparseOperation_t

This type indicates which operations is applied to the related input (e.g. sparse matrix, or vector).

Value

Meaning

CUSPARSE_OPERATION_NON_TRANSPOSE

The non-transpose operation is selected.

CUSPARSE_OPERATION_TRANSPOSE

The transpose operation is selected.

CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE

The conjugate transpose operation is selected.


4.1.6. cusparseDiagType_t

This type indicates if the matrix diagonal entries are unity. The diagonal elements are always assumed to be present, but if CUSPARSE_DIAG_TYPE_UNIT is passed to an API routine, then the routine assumes that all diagonal entries are unity and will not read or modify those entries. Note that in this case the routine assumes the diagonal entries are equal to one, regardless of what those entries are actually set to in memory.

Value

Meaning

CUSPARSE_DIAG_TYPE_NON_UNIT

The matrix diagonal has non-unit elements.

CUSPARSE_DIAG_TYPE_UNIT

The matrix diagonal has unit elements.


4.1.7. cusparseFillMode_t

This type indicates if the lower or upper part of a matrix is stored in sparse storage.

Value

Meaning

CUSPARSE_FILL_MODE_LOWER

The lower triangular part is stored.

CUSPARSE_FILL_MODE_UPPER

The upper triangular part is stored.


4.1.8. cusparseIndexBase_t

This type indicates if the base of the matrix indices is zero or one.

Value

Meaning

CUSPARSE_INDEX_BASE_ZERO

The base index is zero (C compatibility).

CUSPARSE_INDEX_BASE_ONE

The base index is one (Fortran compatibility).


4.1.9. cusparseDirection_t

This type indicates whether the elements of a dense matrix should be parsed by rows or by columns (assuming column-major storage in memory of the dense matrix) in function cusparse[S|D|C|Z]nnz. Besides storage format of blocks in BSR format is also controlled by this type.

Value

Meaning

CUSPARSE_DIRECTION_ROW

The matrix should be parsed by rows.

CUSPARSE_DIRECTION_COLUMN

The matrix should be parsed by columns.



4.2. cuSPARSE Management API

The cuSPARSE functions for managing the library are described in this section.


4.2.1. cusparseCreate()

cusparseStatus_t
cusparseCreate(cusparseHandle_t *handle)

This function initializes the cuSPARSE library and creates a handle on the cuSPARSE context. It must be called before any other cuSPARSE API function is invoked. It allocates hardware resources necessary for accessing the GPU.

Param.

In/out

Meaning

handle

IN

The pointer to the handle to the cuSPARSE context

See cusparseStatus_t for the description of the return status


4.2.2. cusparseDestroy()

cusparseStatus_t
cusparseDestroy(cusparseHandle_t handle)

This function releases CPU-side resources used by the cuSPARSE library. The release of GPU-side resources may be deferred until the application shuts down.

Param.

In/out

Meaning

handle

IN

The handle to the cuSPARSE context

See cusparseStatus_t for the description of the return status


4.2.3. cusparseGetErrorName()

const char*
cusparseGetErrorString(cusparseStatus_t status)

The function returns the string representation of an error code enum name. If the error code is not recognized, “unrecognized error code” is returned.

Param.

In/out

Meaning

status

IN

Error code to convert to string

const char*

OUT

Pointer to a NULL-terminated string


4.2.4. cusparseGetErrorString()

const char*
cusparseGetErrorString(cusparseStatus_t status)

Returns the description string for an error code. If the error code is not recognized, “unrecognized error code” is returned.

Param.

In/out

Meaning

status

IN

Error code to convert to string

const char*

OUT

Pointer to a NULL-terminated string


4.2.5. cusparseGetProperty()

cusparseStatus_t
cusparseGetProperty(libraryPropertyType type,
                    int*                value)

The function returns the value of the requested property. Refer to libraryPropertyType for supported types.

Param.

In/out

Meaning

type

IN

Requested property

value

OUT

Value of the requested property

libraryPropertyType (defined in library_types.h):

Value

Meaning

MAJOR_VERSION

Enumerator to query the major version

MINOR_VERSION

Enumerator to query the minor version

PATCH_LEVEL

Number to identify the patch level

See cusparseStatus_t for the description of the return status


4.2.6. cusparseGetVersion()

cusparseStatus_t
cusparseGetVersion(cusparseHandle_t handle,
                   int*             version)

This function returns the version number of the cuSPARSE library.

Param.

In/out

Meaning

handle

IN

cuSPARSE handle

version

OUT

The version number of the library

See cusparseStatus_t for the description of the return status


4.2.7. cusparseGetPointerMode()

cusparseStatus_t
cusparseGetPointerMode(cusparseHandlet handle,
                       cusparsePointerMode_t *mode)

This function obtains the pointer mode used by the cuSPARSE library. Please see the section on the cusparsePointerMode_t type for more details.

Param.

In/out

Meaning

handle

IN

The handle to the cuSPARSE context

mode

OUT

One of the enumerated pointer mode types

See cusparseStatus_t for the description of the return status


4.2.8. cusparseSetPointerMode()

cusparseStatus_t
cusparseSetPointerMode(cusparseHandle_t handle,
                       cusparsePointerMode_t mode)

This function sets the pointer mode used by the cuSPARSE library. The default is for the values to be passed by reference on the host. Please see the section on the cublasPointerMode_t type for more details.

Param.

In/out

Meaning

handle

IN

The handle to the cuSPARSE context

mode

IN

One of the enumerated pointer mode types

See cusparseStatus_t for the description of the return status


4.2.9. cusparseGetStream()

cusparseStatus_t
cusparseGetStream(cusparseHandle_t handle, cudaStream_t *streamId)

This function gets the cuSPARSE library stream, which is being used to to execute all calls to the cuSPARSE library functions. If the cuSPARSE library stream is not set, all kernels use the default NULL stream.

Param.

In/out

Meaning

handle

IN

The handle to the cuSPARSE context

streamId

OUT

The stream used by the library

See cusparseStatus_t for the description of the return status


4.2.10. cusparseSetStream()

cusparseStatus_t
cusparseSetStream(cusparseHandle_t handle, cudaStream_t streamId)

This function sets the stream to be used by the cuSPARSE library to execute its routines.

Param.

In/out

Meaning

handle

IN

The handle to the cuSPARSE context

streamId

IN

The stream to be used by the library

See cusparseStatus_t for the description of the return status



4.3. cuSPARSE Logging API

cuSPARSE logging mechanism can be enabled by setting the following environment variables before launching the target application:

CUSPARSE_LOG_LEVEL=<level> - while level is one of the following levels:

  • 0 - Off - logging is disabled (default)

  • 1 - Error - only errors will be logged

  • 2 - Trace - API calls that launch CUDA kernels will log their parameters and important information

  • 3 - Hints - hints that can potentially improve the application’s performance

  • 4 - Info - provides general information about the library execution, may contain details about heuristic status

  • 5 - API Trace - API calls will log their parameter and important information

CUSPARSE_LOG_MASK=<mask> - while mask is a combination of the following masks:

  • 0 - Off

  • 1 - Error

  • 2 - Trace

  • 4 - Hints

  • 8 - Info

  • 16 - API Trace

CUSPARSE_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may contain %i, that will be replaced with the process id. E.g <file_name>_%i.log.

If CUSPARSE_LOG_FILE is not defined, the log messages are printed to stdout.

Starting from CUDA 12.3, it is also possible to dump sparse matrices (CSR, CSC, COO, SELL, BSR) in binary files during the creation by setting the environment variable CUSPARSE_STORE_INPUT_MATRIX. Later on, the binary files can be send to Math-Libs-Feedback@nvidia.com for debugging and reproducibility purposes of a specific correctness/performance issue.

Another option is to use the experimental cuSPARSE logging API. See:

Note

The logging mechanism is not available for the legacy APIs.


4.3.1. cusparseLoggerSetCallback()

cusparseStatus_t
cusparseLoggerSetCallback(cusparseLoggerCallback_t callback)

Experimental: The function sets the logging callback function.

Param.

In/out

Meaning

callback

IN

Pointer to a callback function

where cusparseLoggerCallback_t has the following signature:

void (*cusparseLoggerCallback_t)(int         logLevel,
                                 const char* functionName,
                                 const char* message)

Param.

In/out

Meaning

logLevel

IN

Selected log level

functionName

IN

The name of the API that logged this message

message

IN

The log message

See cusparseStatus_t for the description of the return status


4.3.2. cusparseLoggerSetFile()

cusparseStatus_t
cusparseLoggerSetFile(FILE* file)

Experimental: The function sets the logging output file. Note: once registered using this function call, the provided file handle must not be closed unless the function is called again to switch to a different file handle.

Param.

In/out

Meaning

file

IN

Pointer to an open file. File should have write permission

See cusparseStatus_t for the description of the return status


4.3.3. cusparseLoggerOpenFile()

cusparseStatus_t
cusparseLoggerOpenFile(const char* logFile)

Experimental: The function opens a logging output file in the given path.

Param.

In/out

Meaning

logFile

IN

Path of the logging output file

See cusparseStatus_t for the description of the return status


4.3.4. cusparseLoggerSetLevel()

cusparseStatus_t
cusparseLoggerSetLevel(int level)

Experimental: The function sets the value of the logging level. path.

Param.

In/out

Meaning

level

IN

Value of the logging level

See cusparseStatus_t for the description of the return status


4.3.5. cusparseLoggerSetMask()

cusparseStatus_t
cusparseLoggerSetMask(int mask)

Experimental: The function sets the value of the logging mask.

Param.

In/out

Meaning

mask

IN

Value of the logging mask

See cusparseStatus_t for the description of the return status



5. cuSPARSE Legacy APIs

5.1. Naming Conventions

The cuSPARSE legacy functions are available for data types float, double, cuComplex, and cuDoubleComplex. The sparse Level 2, and Level 3 functions follow this naming convention:

cusparse<t>[<matrix data format>]<operation>[<output matrix data format>]

where <t> can be S, D, C, Z, or X, corresponding to the data types float, double, cuComplex, cuDoubleComplex, and the generic type, respectively.

The <matrix data format> can be dense, coo, csr, or csc, corresponding to the dense, coordinate, compressed sparse row, and compressed sparse column formats, respectively.

5.2. cuSPARSE Legacy Types Reference

5.2.1. cusparseAction_t

This type indicates whether the operation is performed only on indices or on data and indices.

Value

Meaning

CUSPARSE_ACTION_SYMBOLIC

the operation is performed only on indices.

CUSPARSE_ACTION_NUMERIC

the operation is performed on data and indices.

5.2.2. cusparseMatDescr_t

This structure is used to describe the shape and properties of a matrix.

typedef struct {
    cusparseMatrixType_t MatrixType;
    cusparseFillMode_t FillMode;
    cusparseDiagType_t DiagType;
    cusparseIndexBase_t IndexBase;
} cusparseMatDescr_t;

5.2.3. cusparseMatrixType_t

This type indicates the type of matrix stored in sparse storage. Notice that for symmetric, Hermitian and triangular matrices only their lower or upper part is assumed to be stored.

The whole idea of matrix type and fill mode is to keep minimum storage for symmetric/Hermitian matrix, and also to take advantage of symmetric property on SpMV (Sparse Matrix Vector multiplication). To compute y=A*x when A is symmetric and only lower triangular part is stored, two steps are needed. First step is to compute y=(L+D)*x and second step is to compute y=L^T*x + y. Given the fact that the transpose operation y=L^T*x is 10x slower than non-transpose version y=L*x, the symmetric property does not show up any performance gain. It is better for the user to extend the symmetric matrix to a general matrix and apply y=A*x with matrix type CUSPARSE_MATRIX_TYPE_GENERAL.

In general, SpMV, preconditioners (incomplete Cholesky or incomplete LU) and triangular solver are combined together in iterative solvers, for example PCG and GMRES. If the user always uses general matrix (instead of symmetric matrix), there is no need to support other than general matrix in preconditioners. Therefore the new routines, [bsr|csr]sv2 (triangular solver), [bsr|csr]ilu02 (incomplete LU) and [bsr|csr]ic02 (incomplete Cholesky), only support matrix type CUSPARSE_MATRIX_TYPE_GENERAL.

Value

Meaning

CUSPARSE_MATRIX_TYPE_GENERAL

the matrix is general.

CUSPARSE_MATRIX_TYPE_SYMMETRIC

the matrix is symmetric.

CUSPARSE_MATRIX_TYPE_HERMITIAN

the matrix is Hermitian.

CUSPARSE_MATRIX_TYPE_TRIANGULAR

the matrix is triangular.

5.2.4. cusparseColorInfo_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used in csrcolor().

5.2.5. cusparseSolvePolicy_t [DEPRECATED]

This type indicates whether level information is generated and used in csrsv2, csric02, csrilu02, bsrsv2, bsric02 and bsrilu02.

Value

Meaning

CUSPARSE_SOLVE_POLICY_NO_LEVEL

no level information is generated and used.

CUSPARSE_SOLVE_POLICY_USE_LEVEL

generate and use level information.

5.2.6. bsric02Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used in bsric02_bufferSize(), bsric02_analysis(), and bsric02().

5.2.7. bsrilu02Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used in bsrilu02_bufferSize(), bsrilu02_analysis(), and bsrilu02().

5.2.8. bsrsm2Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used in bsrsm2_bufferSize(), bsrsm2_analysis(), and bsrsm2_solve().

5.2.9. bsrsv2Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used in bsrsv2_bufferSize(), bsrsv2_analysis(), and bsrsv2_solve().

5.2.10. csric02Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used in csric02_bufferSize(), csric02_analysis(), and csric02().

5.2.11. csrilu02Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used in csrilu02_bufferSize(), csrilu02_analysis(), and csrilu02().

5.3. cuSPARSE Helper Function Reference

The cuSPARSE helper functions are described in this section.

5.3.1. cusparseCreateColorInfo() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseCreateColorInfo(cusparseColorInfo_t* info)

This function creates and initializes the cusparseColorInfo_t structure to default values.

Input

info

the pointer to the cusparseColorInfo_t structure

See cusparseStatus_t for the description of the return status

5.3.2. cusparseCreateMatDescr()

cusparseStatus_t
cusparseCreateMatDescr(cusparseMatDescr_t *descrA)

This function initializes the matrix descriptor. It sets the fields MatrixType and IndexBase to the default values CUSPARSE_MATRIX_TYPE_GENERAL and CUSPARSE_INDEX_BASE_ZERO , respectively, while leaving other fields uninitialized.

Input

descrA

the pointer to the matrix descriptor.

See cusparseStatus_t for the description of the return status

5.3.3. cusparseDestroyColorInfo() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseDestroyColorInfo(cusparseColorInfo_t info)

This function destroys and releases any memory required by the structure.

Input

info

the pointer to the structure of csrcolor()

See cusparseStatus_t for the description of the return status

5.3.4. cusparseDestroyMatDescr()

cusparseStatus_t
cusparseDestroyMatDescr(cusparseMatDescr_t descrA)

This function releases the memory allocated for the matrix descriptor.

Input

descrA

the matrix descriptor.

See cusparseStatus_t for the description of the return status

5.3.5. cusparseGetMatDiagType()

cusparseDiagType_t
cusparseGetMatDiagType(const cusparseMatDescr_t descrA)

This function returns the DiagType field of the matrix descriptor descrA.

Input

descrA

the matrix descriptor.

Returned

 

One of the enumerated diagType types.

5.3.6. cusparseGetMatFillMode()

cusparseFillMode_t
cusparseGetMatFillMode(const cusparseMatDescr_t descrA)

This function returns the FillMode field of the matrix descriptor descrA.

Input

descrA

the matrix descriptor.

Returned

 

One of the enumerated fillMode types.

5.3.7. cusparseGetMatIndexBase()

cusparseIndexBase_t
cusparseGetMatIndexBase(const cusparseMatDescr_t descrA)

This function returns the IndexBase field of the matrix descriptor descrA.

Input

descrA

the matrix descriptor.

Returned

 

One of the enumerated indexBase types.

5.3.8. cusparseGetMatType()

cusparseMatrixType_t
cusparseGetMatType(const cusparseMatDescr_t descrA)

This function returns the MatrixType field of the matrix descriptor descrA.

Input

descrA

the matrix descriptor.

Returned

 

One of the enumerated matrix types.

5.3.9. cusparseSetMatDiagType()

cusparseStatus_t
cusparseSetMatDiagType(cusparseMatDescr_t descrA,
                       cusparseDiagType_t diagType)

This function sets the DiagType field of the matrix descriptor descrA.

Input

diagType

One of the enumerated diagType types.

Output

descrA

the matrix descriptor.

See cusparseStatus_t for the description of the return status

5.3.10. cusparseSetMatFillMode()

cusparseStatus_t
cusparseSetMatFillMode(cusparseMatDescr_t descrA,
                       cusparseFillMode_t fillMode)

This function sets the FillMode field of the matrix descriptor descrA.

Input

fillMode

One of the enumerated fillMode types.

Output

descrA

the matrix descriptor.

See cusparseStatus_t for the description of the return status

5.3.11. cusparseSetMatIndexBase()

cusparseStatus_t
cusparseSetMatIndexBase(cusparseMatDescr_t descrA,
                        cusparseIndexBase_t base)

This function sets the IndexBase field of the matrix descriptor descrA.

Input

base

One of the enumerated indexBase types.

Output

descrA

the matrix descriptor.

See cusparseStatus_t for the description of the return status

5.3.12. cusparseSetMatType()

cusparseStatus_t
cusparseSetMatType(cusparseMatDescr_t descrA, cusparseMatrixType_t type)

This function sets the MatrixType field of the matrix descriptor descrA.

Input

type

One of the enumerated matrix types.

Output

descrA

the matrix descriptor.

See cusparseStatus_t for the description of the return status

5.3.13. cusparseCreateCsric02Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseCreateCsric02Info(csric02Info_t *info);

This function creates and initializes the solve and analysis structure of incomplete Cholesky to default values.

Input

info

the pointer to the solve and analysis structure of incomplete Cholesky.

See cusparseStatus_t for the description of the return status

5.3.14. cusparseDestroyCsric02Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseDestroyCsric02Info(csric02Info_t info);

This function destroys and releases any memory required by the structure.

Input

info

the solve (csric02_solve) and analysis (csric02_analysis) structure.

See cusparseStatus_t for the description of the return status

5.3.15. cusparseCreateCsrilu02Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseCreateCsrilu02Info(csrilu02Info_t *info);

This function creates and initializes the solve and analysis structure of incomplete LU to default values.

Input

info

the pointer to the solve and analysis structure of incomplete LU.

See cusparseStatus_t for the description of the return status

5.3.16. cusparseDestroyCsrilu02Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseDestroyCsrilu02Info(csrilu02Info_t info);

This function destroys and releases any memory required by the structure.

Input

info

the solve (csrilu02_solve) and analysis (csrilu02_analysis) structure.

See cusparseStatus_t for the description of the return status

5.3.17. cusparseCreateBsrsv2Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseCreateBsrsv2Info(bsrsv2Info_t *info);

This function creates and initializes the solve and analysis structure of bsrsv2 to default values.

Input

info

the pointer to the solve and analysis structure of bsrsv2.

See cusparseStatus_t for the description of the return status

5.3.18. cusparseDestroyBsrsv2Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseDestroyBsrsv2Info(bsrsv2Info_t info);

This function destroys and releases any memory required by the structure.

Input

info

the solve (bsrsv2_solve) and analysis (bsrsv2_analysis) structure.

See cusparseStatus_t for the description of the return status

5.3.19. cusparseCreateBsrsm2Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseCreateBsrsm2Info(bsrsm2Info_t *info);

This function creates and initializes the solve and analysis structure of bsrsm2 to default values.

Input

info

the pointer to the solve and analysis structure of bsrsm2.

See cusparseStatus_t for the description of the return status

5.3.20. cusparseDestroyBsrsm2Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseDestroyBsrsm2Info(bsrsm2Info_t info);

This function destroys and releases any memory required by the structure.

Input

info

the solve (bsrsm2_solve) and analysis (bsrsm2_analysis) structure.

See cusparseStatus_t for the description of the return status

5.3.21. cusparseCreateBsric02Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseCreateBsric02Info(bsric02Info_t *info);

This function creates and initializes the solve and analysis structure of block incomplete Cholesky to default values.

Input

info

the pointer to the solve and analysis structure of block incomplete Cholesky.

See cusparseStatus_t for the description of the return status

5.3.22. cusparseDestroyBsric02Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseDestroyBsric02Info(bsric02Info_t info);

This function destroys and releases any memory required by the structure.

Input

info

the solve (bsric02_solve) and analysis (bsric02_analysis) structure.

See cusparseStatus_t for the description of the return status

5.3.23. cusparseCreateBsrilu02Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseCreateBsrilu02Info(bsrilu02Info_t *info);

This function creates and initializes the solve and analysis structure of block incomplete LU to default values.

Input

info

the pointer to the solve and analysis structure of block incomplete LU.

See cusparseStatus_t for the description of the return status

5.3.24. cusparseDestroyBsrilu02Info() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseDestroyBsrilu02Info(bsrilu02Info_t info);

This function destroys and releases any memory required by the structure.

Input

info

the solve (bsrilu02_solve) and analysis (bsrilu02_analysis) structure.

See cusparseStatus_t for the description of the return status

5.3.25. cusparseCreatePruneInfo() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseCreatePruneInfo(pruneInfo_t *info);

This function creates and initializes structure of prune to default values.

Input

info

the pointer to the structure of prune.

See cusparseStatus_t for the description of the return status

5.3.26. cusparseDestroyPruneInfo() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseDestroyPruneInfo(pruneInfo_t info);

This function destroys and releases any memory required by the structure.

Input

info

the structure of prune.

See cusparseStatus_t for the description of the return status

5.4. cuSPARSE Level 2 Function Reference

This chapter describes the sparse linear algebra functions that perform operations between sparse matrices and dense vectors.

5.4.1. cusparse<t>bsrmv()

cusparseStatus_t
cusparseSbsrmv(cusparseHandle_t         handle,
               cusparseDirection_t      dir,
               cusparseOperation_t      trans,
               int                      mb,
               int                      nb,
               int                      nnzb,
               const float*             alpha,
               const cusparseMatDescr_t descr,
               const float*             bsrVal,
               const int*               bsrRowPtr,
               const int*               bsrColInd,
               int                      blockDim,
               const float*             x,
               const float*             beta,
               float*                   y)

cusparseStatus_t
cusparseDbsrmv(cusparseHandle_t         handle,
               cusparseDirection_t      dir,
               cusparseOperation_t      trans,
               int                      mb,
               int                      nb,
               int                      nnzb,
               const double*            alpha,
               const cusparseMatDescr_t descr,
               const double*            bsrVal,
               const int*               bsrRowPtr,
               const int*               bsrColInd,
               int                      blockDim,
               const double*            x,
               const double*            beta,
               double*                  y)

cusparseStatus_t
cusparseCbsrmv(cusparseHandle_t         handle,
               cusparseDirection_t      dir,
               cusparseOperation_t      trans,
               int                      mb,
               int                      nb,
               int                      nnzb,
               const cuComplex*         alpha,
               const cusparseMatDescr_t descr,
               const cuComplex*         bsrVal,
               const int*               bsrRowPtr,
               const int*               bsrColInd,
               int                      blockDim,
               const cuComplex*         x,
               const cuComplex*         beta,
               cuComplex*               y)

cusparseStatus_t
cusparseZbsrmv(cusparseHandle_t         handle,
               cusparseDirection_t      dir,
               cusparseOperation_t      trans,
               int                      mb,
               int                      nb,
               int                      nnzb,
               const cuDoubleComplex*   alpha,
               const cusparseMatDescr_t descr,
               const cuDoubleComplex*   bsrVal,
               const int*               bsrRowPtr,
               const int*               bsrColInd,
               int                      blockDim,
               const cuDoubleComplex*   x,
               const cuDoubleComplex*   beta,
               cuDoubleComplex*         y)

This function performs the matrix-vector operation

\(\text{y} = \alpha \ast \text{op}(A) \ast \text{x} + \beta \ast \text{y}\)

where \(A\text{ is an }(mb \ast blockDim) \times (nb \ast blockDim)\) sparse matrix that is defined in BSR storage format by the three arrays bsrVal, bsrRowPtr, and bsrColInd); x and y are vectors; \(\alpha\text{ and }\beta\) are scalars; and

\(\text{op}(A) = \begin{cases} A & \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if trans == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)

bsrmv() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Several comments on bsrmv():

  • Only blockDim > 1 is supported

  • Only CUSPARSE_OPERATION_NON_TRANSPOSE is supported, that is

\(\text{y} = \alpha \ast A \ast \text{x} + \beta{} \ast \text{y}\)

  • Only CUSPARSE_MATRIX_TYPE_GENERAL is supported.

  • The size of vector x should be \((nb \ast blockDim)\) at least, and the size of vector y should be \((mb \ast blockDim)\) at least; otherwise, the kernel may return CUSPARSE_STATUS_EXECUTION_FAILED because of an out-of-bounds array.

For example, suppose the user has a CSR format and wants to try bsrmv(), the following code demonstrates how to use csr2bsr() conversion and bsrmv() multiplication in single precision.

// Suppose that A is m x n sparse matrix represented by CSR format,
// hx is a host vector of size n, and hy is also a host vector of size m.
// m and n are not multiple of blockDim.
// step 1: transform CSR to BSR with column-major order
int base, nnz;
int nnzb;
cusparseDirection_t dirA = CUSPARSE_DIRECTION_COLUMN;
int mb = (m + blockDim-1)/blockDim;
int nb = (n + blockDim-1)/blockDim;
cudaMalloc((void**)&bsrRowPtrC, sizeof(int) *(mb+1));
cusparseXcsr2bsrNnz(handle, dirA, m, n,
        descrA, csrRowPtrA, csrColIndA, blockDim,
        descrC, bsrRowPtrC, &nnzb);
cudaMalloc((void**)&bsrColIndC, sizeof(int)*nnzb);
cudaMalloc((void**)&bsrValC, sizeof(float)*(blockDim*blockDim)*nnzb);
cusparseScsr2bsr(handle, dirA, m, n,
        descrA, csrValA, csrRowPtrA, csrColIndA, blockDim,
        descrC, bsrValC, bsrRowPtrC, bsrColIndC);
// step 2: allocate vector x and vector y large enough for bsrmv
cudaMalloc((void**)&x, sizeof(float)*(nb*blockDim));
cudaMalloc((void**)&y, sizeof(float)*(mb*blockDim));
cudaMemcpy(x, hx, sizeof(float)*n, cudaMemcpyHostToDevice);
cudaMemcpy(y, hy, sizeof(float)*m, cudaMemcpyHostToDevice);
// step 3: perform bsrmv
cusparseSbsrmv(handle, dirA, transA, mb, nb, nnzb, &alpha,
   descrC, bsrValC, bsrRowPtrC, bsrColIndC, blockDim, x, &beta, y);

Input

handle

handle to the cuSPARSE library context.

dir

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

trans

the operation \(\text{op}(A)\) . Only CUSPARSE_OPERATION_NON_TRANSPOSE is supported.

mb

number of block rows of matrix \(A\).

nb

number of block columns of matrix \(A\).

nnzb

number of nonzero blocks of matrix \(A\).

alpha

<type> scalar used for multiplication.

descr

the descriptor of matrix \(A\). The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

bsrVal

<type> array of nnz\(( =\)csrRowPtrA(mb)\(-\)csrRowPtrA(0)\()\) nonzero blocks of matrix \(A\).

bsrRowPtr

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColInd

integer array of nnz\(( =\)csrRowPtrA(mb)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix \(A\).

blockDim

block dimension of sparse matrix \(A\), larger than zero.

x

<type> vector of \(nb \ast blockDim\) elements.

beta

<type> scalar used for multiplication. If beta is zero, y does not have to be a valid input.

y

<type> vector of \(mb \ast blockDim\) elements.

Output

y

<type> updated vector.

See cusparseStatus_t for the description of the return status.

5.4.2. cusparse<t>bsrxmv() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsrxmv(cusparseHandle_t         handle,
                cusparseDirection_t      dir,
                cusparseOperation_t      trans,
                int                      sizeOfMask,
                int                      mb,
                int                      nb,
                int                      nnzb,
                const float*             alpha,
                const cusparseMatDescr_t descr,
                const float*             bsrVal,
                const int*               bsrMaskPtr,
                const int*               bsrRowPtr,
                const int*               bsrEndPtr,
                const int*               bsrColInd,
                int                      blockDim,
                const float*             x,
                const float*             beta,
                float*                   y)

cusparseStatus_t
cusparseDbsrxmv(cusparseHandle_t         handle,
                cusparseDirection_t      dir,
                cusparseOperation_t      trans,
                int                      sizeOfMask,
                int                      mb,
                int                      nb,
                int                      nnzb,
                const double*            alpha,
                const cusparseMatDescr_t descr,
                const double*            bsrVal,
                const int*               bsrMaskPtr,
                const int*               bsrRowPtr,
                const int*               bsrEndPtr,
                const int*               bsrColInd,
                int                      blockDim,
                const double*            x,
                const double*            beta,
                double*                  y)

cusparseStatus_t
cusparseCbsrxmv(cusparseHandle_t         handle,
                cusparseDirection_t      dir,
                cusparseOperation_t      trans,
                int                      sizeOfMask,
                int                      mb,
                int                      nb,
                int                      nnzb,
                const cuComplex*         alpha,
                const cusparseMatDescr_t descr,
                const cuComplex*         bsrVal,
                const int*               bsrMaskPtr,
                const int*               bsrRowPtr,
                const int*               bsrEndPtr,
                const int*               bsrColInd,
                int                      blockDim,
                const cuComplex*         x,
                const cuComplex*         beta,
                cuComplex*               y)

cusparseStatus_t
cusparseZbsrxmv(cusparseHandle_t         handle,
                cusparseDirection_t      dir,
                cusparseOperation_t      trans,
                int                      sizeOfMask,
                int                      mb,
                int                      nb,
                int                      nnzb,
                const cuDoubleComplex*   alpha,
                const cusparseMatDescr_t descr,
                const cuDoubleComplex*   bsrVal,
                const int*               bsrMaskPtr,
                const int*               bsrRowPtr,
                const int*               bsrEndPtr,
                const int*               bsrColInd,
                int                      blockDim,
                const cuDoubleComplex*   x,
                const cuDoubleComplex*   beta,
                cuDoubleComplex*         y)

This function performs a bsrmv and a mask operation

\(\text{y(mask)} = (\alpha \ast \text{op}(A) \ast \text{x} + \beta \ast \text{y})\text{(mask)}\)

where \(A\text{ is an }(mb \ast blockDim) \times (nb \ast blockDim)\) sparse matrix that is defined in BSRX storage format by the four arrays bsrVal, bsrRowPtr, bsrEndPtr, and bsrColInd); x and y are vectors; \(\alpha\text{~and~}\beta\) are scalars; and

\(\text{op}(A) = \begin{cases} A & \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if trans == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)

The mask operation is defined by array bsrMaskPtr which contains updated block row indices of \(y\) . If row \(i\) is not specified in bsrMaskPtr, then bsrxmv() does not touch row block \(i\) of \(A\) and \(y\) .

For example, consider the \(2 \times 3\) block matrix \(A\):

\(\begin{matrix} {A = \begin{bmatrix} A_{11} & A_{12} & O \\ A_{21} & A_{22} & A_{23} \\ \end{bmatrix}} \\ \end{matrix}\)

and its one-based BSR format (three vector form) is

\(\begin{matrix} \text{bsrVal} & = & \begin{bmatrix} A_{11} & A_{12} & A_{21} & A_{22} & A_{23} \\ \end{bmatrix} \\ \text{bsrRowPtr} & = & \begin{bmatrix} {1\phantom{.0}} & {3\phantom{.0}} & 6 \\ \end{bmatrix} \\ \text{bsrColInd} & = & \begin{bmatrix} {1\phantom{.0}} & {2\phantom{.0}} & {1\phantom{.0}} & {2\phantom{.0}} & 3 \\ \end{bmatrix} \\ \end{matrix}\)

Suppose we want to do the following bsrmv operation on a matrix \(\overset{¯}{A}\) which is slightly different from \(A\) .

\(\begin{bmatrix} y_{1} \\ y_{2} \\ \end{bmatrix}:=alpha \ast (\widetilde{A} = \begin{bmatrix} O & O & O \\ O & A_{22} & O \\ \end{bmatrix}) \ast \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ \end{bmatrix} + \begin{bmatrix} y_{1} \\ {beta \ast y_{2}} \\ \end{bmatrix}\)

We don’t need to create another BSR format for the new matrix \(\overset{¯}{A}\) , all that we should do is to keep bsrVal and bsrColInd unchanged, but modify bsrRowPtr and add an additional array bsrEndPtr which points to the last nonzero elements per row of \(\overset{¯}{A}\) plus 1.

For example, the following bsrRowPtr and bsrEndPtr can represent matrix \(\overset{¯}{A}\) :

\(\begin{matrix} \text{bsrRowPtr} & = & \begin{bmatrix} {1\phantom{.0}} & 4 \\ \end{bmatrix} \\ \text{bsrEndPtr} & = & \begin{bmatrix} {1\phantom{.0}} & 5 \\ \end{bmatrix} \\ \end{matrix}\)

Further we can use a mask operator (specified by array bsrMaskPtr) to update particular block row indices of \(y\) only because \(y_{1}\) is never changed. In this case, bsrMaskPtr\(=\) [2] and sizeOfMask=1.

The mask operator is equivalent to the following operation:

\(\begin{bmatrix} ? \\ y_{2} \\ \end{bmatrix}:=alpha \ast \begin{bmatrix} ? & ? & ? \\ O & A_{22} & O \\ \end{bmatrix} \ast \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ \end{bmatrix} + beta \ast \begin{bmatrix} ? \\ y_{2} \\ \end{bmatrix}\)

If a block row is not present in the bsrMaskPtr, then no calculation is performed on that row, and the corresponding value in y is unmodified. The question mark “?” is used to inidcate row blocks not in bsrMaskPtr.

In this case, first row block is not present in bsrMaskPtr, so bsrRowPtr[0] and bsrEndPtr[0] are not touched also.

\(\begin{matrix} \text{bsrRowPtr} & = & \begin{bmatrix} {?\phantom{.0}} & 4 \\ \end{bmatrix} \\ \text{bsrEndPtr} & = & \begin{bmatrix} {?\phantom{.0}} & 5 \\ \end{bmatrix} \\ \end{matrix}\)

bsrxmv() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

A couple of comments on bsrxmv():

  • Only blockDim > 1 is supported

  • Only CUSPARSE_OPERATION_NON_TRANSPOSE and CUSPARSE_MATRIX_TYPE_GENERAL are supported.

  • Parameters bsrMaskPtr, bsrRowPtr, bsrEndPtr and bsrColInd are consistent with base index, either one-based or zero-based. The above example is one-based.

Input

handle

handle to the cuSPARSE library context.

dir

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

trans

the operation \(\text{op}(A)\) . Only CUSPARSE_OPERATION_NON_TRANSPOSE is supported.

sizeOfMask

number of updated block rows of \(y\).

mb

number of block rows of matrix \(A\).

nb

number of block columns of matrix \(A\).

nnzb

number of nonzero blocks of matrix \(A\).

alpha

<type> scalar used for multiplication.

descr

the descriptor of matrix \(A\). The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

bsrVal

<type> array of nnz nonzero blocks of matrix \(A\).

bsrMaskPtr

integer array of sizeOfMask elements that contains the indices corresponding to updated block rows.

bsrRowPtr

integer array of mb elements that contains the start of every block row.

bsrEndPtr

integer array of mb elements that contains the end of the every block row plus one.

bsrColInd

integer array of nnzb column indices of the nonzero blocks of matrix \(A\).

blockDim

block dimension of sparse matrix \(A\), larger than zero.

x

<type> vector of \(nb \ast blockDim\) elements.

beta

<type> scalar used for multiplication. If beta is zero, y does not have to be a valid input.

y

<type> vector of \(mb \ast blockDim\) elements.

See cusparseStatus_t for the description of the return status.

5.4.3. cusparse<t>bsrsv2_bufferSize() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsrsv2_bufferSize(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           cusparseOperation_t      transA,
                           int                      mb,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           float*                   bsrValA,
                           const int*               bsrRowPtrA,
                           const int*               bsrColIndA,
                           int                      blockDim,
                           bsrsv2Info_t             info,
                           int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseDbsrsv2_bufferSize(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           cusparseOperation_t      transA,
                           int                      mb,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           double*                  bsrValA,
                           const int*               bsrRowPtrA,
                           const int*               bsrColIndA,
                           int                      blockDim,
                           bsrsv2Info_t             info,
                           int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseCbsrsv2_bufferSize(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           cusparseOperation_t      transA,
                           int                      mb,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           cuComplex*               bsrValA,
                           const int*               bsrRowPtrA,
                           const int*               bsrColIndA,
                           int                      blockDim,
                           bsrsv2Info_t             info,
                           int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseZbsrsv2_bufferSize(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           cusparseOperation_t      transA,
                           int                      mb,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           cuDoubleComplex*         bsrValA,
                           const int*               bsrRowPtrA,
                           const int*               bsrColIndA,
                           int                      blockDim,
                           bsrsv2Info_t             info,
                           int*                     pBufferSizeInBytes)

This function returns size of the buffer used in bsrsv2, a new sparse triangular linear system op(A)*y =\(\alpha\)x.

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA); x and y are the right-hand-side and the solution vectors; \(\alpha\) is a scalar; and

\(\text{op}(A) = \begin{cases} A & \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if trans == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)

Although there are six combinations in terms of parameter trans and the upper (lower) triangular part of A, bsrsv2_bufferSize() returns the maximum size buffer among these combinations. The buffer size depends on the dimensions mb, blockDim, and the number of nonzero blocks of the matrix nnzb. If the user changes the matrix, it is necessary to call bsrsv2_bufferSize() again to have the correct buffer size; otherwise a segmentation fault may occur.

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

transA

the operation \(\text{op}(A)\) .

mb

number of block rows of matrix A.

nnzb

number of nonzero blocks of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A; must be larger than zero.

Output

info

record of internal states based on different algorithms.

pBufferSizeInBytes

number of bytes of the buffer used in the bsrsv2_analysis() and bsrsv2_solve().

See cusparseStatus_t for the description of the return status.

5.4.4. cusparse<t>bsrsv2_analysis() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsrsv2_analysis(cusparseHandle_t         handle,
                         cusparseDirection_t      dirA,
                         cusparseOperation_t      transA,
                         int                      mb,
                         int                      nnzb,
                         const cusparseMatDescr_t descrA,
                         const float*             bsrValA,
                         const int*               bsrRowPtrA,
                         const int*               bsrColIndA,
                         int                      blockDim,
                         bsrsv2Info_t             info,
                         cusparseSolvePolicy_t    policy,
                         void*                    pBuffer)

cusparseStatus_t
cusparseDbsrsv2_analysis(cusparseHandle_t         handle,
                         cusparseDirection_t      dirA,
                         cusparseOperation_t      transA,
                         int                      mb,
                         int                      nnzb,
                         const cusparseMatDescr_t descrA,
                         const double*            bsrValA,
                         const int*               bsrRowPtrA,
                         const int*               bsrColIndA,
                         int                      blockDim,
                         bsrsv2Info_t             info,
                         cusparseSolvePolicy_t    policy,
                         void*                    pBuffer)

cusparseStatus_t
cusparseDbsrsv2_analysis(cusparseHandle_t         handle,
                         cusparseDirection_t      dirA,
                         cusparseOperation_t      transA,
                         int                      mb,
                         int                      nnzb,
                         const cusparseMatDescr_t descrA,
                         const cuComplex*         bsrValA,
                         const int*               bsrRowPtrA,
                         const int*               bsrColIndA,
                         int                      blockDim,
                         bsrsv2Info_t             info,
                         cusparseSolvePolicy_t    policy,
                         void*                    pBuffer)

cusparseStatus_t
cusparseZbsrsv2_analysis(cusparseHandle_t         handle,
                         cusparseDirection_t      dirA,
                         cusparseOperation_t      transA,
                         int                      mb,
                         int                      nnzb,
                         const cusparseMatDescr_t descrA,
                         const cuDoubleComplex*   bsrValA,
                         const int*               bsrRowPtrA,
                         const int*               bsrColIndA,
                         int                      blockDim,
                         bsrsv2Info_t             info,
                         cusparseSolvePolicy_t    policy,
                         void*                    pBuffer)

This function performs the analysis phase of bsrsv2, a new sparse triangular linear system op(A)*y =\(\alpha\)x.

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA); x and y are the right-hand side and the solution vectors; \(\alpha\) is a scalar; and

\(\text{op}(A) = \begin{cases} A & \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if trans == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)

The block of BSR format is of size blockDim*blockDim, stored as column-major or row-major as determined by parameter dirA, which is either CUSPARSE_DIRECTION_COLUMN or CUSPARSE_DIRECTION_ROW. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

It is expected that this function will be executed only once for a given matrix and a particular operation type.

This function requires a buffer size returned by bsrsv2_bufferSize(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function bsrsv2_analysis() reports a structural zero and computes level information, which stored in the opaque structure info. The level information can extract more parallelism for a triangular solver. However bsrsv2_solve() can be done without level information. To disable level information, the user needs to specify the policy of the triangular solver as CUSPARSE_SOLVE_POLICY_NO_LEVEL.

Function bsrsv2_analysis() always reports the first structural zero, even when parameter policy is CUSPARSE_SOLVE_POLICY_NO_LEVEL. No structural zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even if block A(j,j) is missing for some j. The user needs to call cusparseXbsrsv2_zeroPivot() to know where the structural zero is.

It is the user’s choice whether to call bsrsv2_solve() if bsrsv2_analysis() reports a structural zero. In this case, the user can still call bsrsv2_solve(), which will return a numerical zero at the same position as a structural zero. However the result x is meaningless.

  • This function requires temporary extra storage that is allocated internally

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

transA

the operation \(\text{op}(A)\) .

mb

number of block rows of matrix A.

nnzb

number of nonzero blocks of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A, larger than zero.

info

structure initialized using cusparseCreateBsrsv2Info().

policy

the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user, the size is return by bsrsv2_bufferSize().

Output

info

structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged).

See cusparseStatus_t for the description of the return status.

5.4.5. cusparse<t>bsrsv2_solve() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsrsv2_solve(cusparseHandle_t         handle,
                      cusparseDirection_t      dirA,
                      cusparseOperation_t      transA,
                      int                      mb,
                      int                      nnzb,
                      const float*             alpha,
                      const cusparseMatDescr_t descrA,
                      const float*             bsrValA,
                      const int*               bsrRowPtrA,
                      const int*               bsrColIndA,
                      int                      blockDim,
                      bsrsv2Info_t             info,
                      const float*             x,
                      float*                   y,
                      cusparseSolvePolicy_t    policy,
                      void*                    pBuffer)

cusparseStatus_t
cusparseDbsrsv2_solve(cusparseHandle_t         handle,
                      cusparseDirection_t      dirA,
                      cusparseOperation_t      transA,
                      int                      mb,
                      int                      nnzb,
                      const double*            alpha,
                      const cusparseMatDescr_t descrA,
                      const double*            bsrValA,
                      const int*               bsrRowPtrA,
                      const int*               bsrColIndA,
                      int                      blockDim,
                      bsrsv2Info_t             info,
                      const double*            x,
                      double*                  y,
                      cusparseSolvePolicy_t    policy,
                      void*                    pBuffer)

cusparseStatus_t
cusparseCbsrsv2_solve(cusparseHandle_t         handle,
                      cusparseDirection_t      dirA,
                      cusparseOperation_t      transA,
                      int                      mb,
                      int                      nnzb,
                      const cuComplex*         alpha,
                      const cusparseMatDescr_t descrA,
                      const cuComplex*         bsrValA,
                      const int*               bsrRowPtrA,
                      const int*               bsrColIndA,
                      int                      blockDim,
                      bsrsv2Info_t             info,
                      const cuComplex*         x,
                      cuComplex*               y,
                      cusparseSolvePolicy_t    policy,
                      void*                    pBuffer)

cusparseStatus_t
cusparseZbsrsv2_solve(cusparseHandle_t         handle,
                      cusparseDirection_t      dirA,
                      cusparseOperation_t      transA,
                      int                      mb,
                      int                      nnzb,
                      const cuDoubleComplex*   alpha,
                      const cusparseMatDescr_t descrA,
                      const cuDoubleComplex*   bsrValA,
                      const int*               bsrRowPtrA,
                      const int*               bsrColIndA,
                      int                      blockDim,
                      bsrsv2Info_t             info,
                      const cuDoubleComplex*   x,
                      cuDoubleComplex*         y,
                      cusparseSolvePolicy_t    policy,
                      void*                    pBuffer)

This function performs the solve phase of bsrsv2, a new sparse triangular linear system op(A)*y =\(\alpha\)x.

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA); x and y are the right-hand-side and the solution vectors; \(\alpha\) is a scalar; and

\(\text{op}(A) = \begin{cases} A & \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if trans == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if trans == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)

The block in BSR format is of size blockDim*blockDim, stored as column-major or row-major as determined by parameter dirA, which is either CUSPARSE_DIRECTION_COLUMN or CUSPARSE_DIRECTION_ROW. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored. Function bsrsv02_solve() can support an arbitrary blockDim.

This function may be executed multiple times for a given matrix and a particular operation type.

This function requires a buffer size returned by bsrsv2_bufferSize(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Although bsrsv2_solve() can be done without level information, the user still needs to be aware of consistency. If bsrsv2_analysis() is called with policy CUSPARSE_SOLVE_POLICY_USE_LEVEL, bsrsv2_solve() can be run with or without levels. On the other hand, if bsrsv2_analysis() is called with CUSPARSE_SOLVE_POLICY_NO_LEVEL, bsrsv2_solve() can only accept CUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise, CUSPARSE_STATUS_INVALID_VALUE is returned.

The level information may not improve the performance, but may spend extra time doing analysis. For example, a tridiagonal matrix has no parallelism. In this case, CUSPARSE_SOLVE_POLICY_NO_LEVEL performs better than CUSPARSE_SOLVE_POLICY_USE_LEVEL. If the user has an iterative solver, the best approach is to do bsrsv2_analysis() with CUSPARSE_SOLVE_POLICY_USE_LEVEL once. Then do bsrsv2_solve() with CUSPARSE_SOLVE_POLICY_NO_LEVEL in the first run, and with CUSPARSE_SOLVE_POLICY_USE_LEVEL in the second run, and pick the fastest one to perform the remaining iterations.

Function bsrsv02_solve() has the same behavior as csrsv02_solve(). That is, bsr2csr(bsrsv02(A)) = csrsv02(bsr2csr(A)). The numerical zero of csrsv02_solve() means there exists some zero A(j,j). The numerical zero of bsrsv02_solve() means there exists some block A(j,j) that is not invertible.

Function bsrsv2_solve() reports the first numerical zero, including a structural zero. No numerical zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even if A(j,j) is not invertible for some j. The user needs to call cusparseXbsrsv2_zeroPivot() to know where the numerical zero is.

The function supports the following properties if pBuffer != NULL

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

For example, suppose L is a lower triangular matrix with unit diagonal, then the following code solves L*y=x by level information.

// Suppose that L is m x m sparse matrix represented by BSR format,
// The number of block rows/columns is mb, and
// the number of nonzero blocks is nnzb.
// L is lower triangular with unit diagonal.
// Assumption:
// - dimension of matrix L is m(=mb*blockDim),
// - matrix L has nnz(=nnzb*blockDim*blockDim) nonzero elements,
// - handle is already created by cusparseCreate(),
// - (d_bsrRowPtr, d_bsrColInd, d_bsrVal) is BSR of L on device memory,
// - d_x is right hand side vector on device memory.
// - d_y is solution vector on device memory.
// - d_x and d_y are of size m.
cusparseMatDescr_t descr = 0;
bsrsv2Info_t info = 0;
int pBufferSize;
void *pBuffer = 0;
int structural_zero;
int numerical_zero;
const double alpha = 1.;
const cusparseSolvePolicy_t policy = CUSPARSE_SOLVE_POLICY_USE_LEVEL;
const cusparseOperation_t trans = CUSPARSE_OPERATION_NON_TRANSPOSE;
const cusparseDirection_t dir = CUSPARSE_DIRECTION_COLUMN;

// step 1: create a descriptor which contains
// - matrix L is base-1
// - matrix L is lower triangular
// - matrix L has unit diagonal, specified by parameter CUSPARSE_DIAG_TYPE_UNIT
//   (L may not have all diagonal elements.)
cusparseCreateMatDescr(&descr);
cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatFillMode(descr, CUSPARSE_FILL_MODE_LOWER);
cusparseSetMatDiagType(descr, CUSPARSE_DIAG_TYPE_UNIT);

// step 2: create a empty info structure
cusparseCreateBsrsv2Info(&info);

// step 3: query how much memory used in bsrsv2, and allocate the buffer
cusparseDbsrsv2_bufferSize(handle, dir, trans, mb, nnzb, descr,
    d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, &pBufferSize);

// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.
cudaMalloc((void**)&pBuffer, pBufferSize);

// step 4: perform analysis
cusparseDbsrsv2_analysis(handle, dir, trans, mb, nnzb, descr,
    d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim,
    info, policy, pBuffer);
// L has unit diagonal, so no structural zero is reported.
status = cusparseXbsrsv2_zeroPivot(handle, info, &structural_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
   printf("L(%d,%d) is missing\n", structural_zero, structural_zero);
}

// step 5: solve L*y = x
cusparseDbsrsv2_solve(handle, dir, trans, mb, nnzb, &alpha, descr,
   d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info,
   d_x, d_y, policy, pBuffer);
// L has unit diagonal, so no numerical zero is reported.
status = cusparseXbsrsv2_zeroPivot(handle, info, &numerical_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
   printf("L(%d,%d) is zero\n", numerical_zero, numerical_zero);
}

// step 6: free resources
cudaFree(pBuffer);
cusparseDestroyBsrsv2Info(info);
cusparseDestroyMatDescr(descr);
cusparseDestroy(handle);

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

transA

the operation \(\text{op}(A)\).

mb

number of block rows and block columns of matrix A.

alpha

<type> scalar used for multiplication.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A, larger than zero.

info

structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged).

x

<type> right-hand-side vector of size m.

policy

the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user, the size is returned by bsrsv2_bufferSize().

Output

y

<type> solution vector of size m.

See cusparseStatus_t for the description of the return status.

5.4.6. cusparseXbsrsv2_zeroPivot() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseXbsrsv2_zeroPivot(cusparseHandle_t handle,
                          bsrsv2Info_t     info,
                          int*             position)

If the returned error code is CUSPARSE_STATUS_ZERO_PIVOT, position=j means A(j,j) is either structural zero or numerical zero (singular block). Otherwise position=-1.

The position can be 0-based or 1-based, the same as the matrix.

Function cusparseXbsrsv2_zeroPivot() is a blocking call. It calls cudaDeviceSynchronize() to make sure all previous kernels are done.

The position can be in the host memory or device memory. The user can set the proper mode with cusparseSetPointerMode().

  • The routine requires no extra storage

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

info

info contains a structural zero or numerical zero if the user already called bsrsv2_analysis() or bsrsv2_solve().

Output

position

if no structural or numerical zero, position is -1; otherwise if A(j,j) is missing or U(j,j) is zero, position=j.

See cusparseStatus_t for the description of the return status

5.4.7. cusparse<t>gemvi()

cusparseStatus_t
cusparseSgemvi_bufferSize(cusparseHandle_t    handle,
                          cusparseOperation_t transA,
                          int                 m,
                          int                 n,
                          int                 nnz,
                          int*                pBufferSize)

cusparseStatus_t
cusparseDgemvi_bufferSize(cusparseHandle_t    handle,
                          cusparseOperation_t transA,
                          int                 m,
                          int                 n,
                          int                 nnz,
                          int*                pBufferSize)

cusparseStatus_t
cusparseCgemvi_bufferSize(cusparseHandle_t    handle,
                          cusparseOperation_t transA,
                          int                 m,
                          int                 n,
                          int                 nnz,
                          int*                pBufferSize)

cusparseStatus_t
cusparseZgemvi_bufferSize(cusparseHandle_t    handle,
                          cusparseOperation_t transA,
                          int                 m,
                          int                 n,
                          int                 nnz,
                          int*                pBufferSize)
cusparseStatus_t
cusparseSgemvi(cusparseHandle_t     handle,
               cusparseOperation_t  transA,
               int                  m,
               int                  n,
               const float*         alpha,
               const float*         A,
               int                  lda,
               int                  nnz,
               const float*         x,
               const int*           xInd,
               const float*         beta,
               float*               y,
               cusparseIndexBase_t  idxBase,
               void*                pBuffer)

cusparseStatus_t
cusparseDgemvi(cusparseHandle_t     handle,
               cusparseOperation_t  transA,
               int                  m,
               int                  n,
               const double*        alpha,
               const double*        A,
               int                  lda,
               int                  nnz,
               const double*        x,
               const int*           xInd,
               const double*        beta,
               double*              y,
               cusparseIndexBase_t  idxBase,
               void*                pBuffer)

cusparseStatus_t
cusparseCgemvi(cusparseHandle_t     handle,
               cusparseOperation_t  transA,
               int                  m,
               int                  n,
               const cuComplex*     alpha,
               const cuComplex*     A,
               int                  lda,
               int                  nnz,
               const cuComplex*     x,
               const int*           xInd,
               const cuComplex*     beta,
               cuComplex*           y,
               cusparseIndexBase_t  idxBase,
               void*                pBuffer)

cusparseStatus_t
cusparseZgemvi(cusparseHandle_t       handle,
               cusparseOperation_t    transA,
               int                    m,
               int                    n,
               const cuDoubleComplex* alpha,
               const cuDoubleComplex* A,
               int                    lda,
               int                    nnz,
               const cuDoubleComplex* x,
               const int*             xInd,
               const cuDoubleComplex* beta,
               cuDoubleComplex*       y,
               cusparseIndexBase_t    idxBase,
               void*                  pBuffer)

This function performs the matrix-vector operation

\(\text{y} = \alpha \ast \text{op}(A) \ast \text{x} + \beta \ast \text{y}\)

A is an m×n dense matrix and a sparse vector x that is defined in a sparse storage format by the two arrays xVal, xInd of length nnz, and y is a dense vector; \(\alpha\)and \(\beta\)are scalars; and

\(\text{op}(A) = \begin{cases} A & \text{if trans == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if trans == CUSPARSE_OPERATION_TRANSPOSE} \\ \end{cases}\)

To simplify the implementation, we have not (yet) optimized the transpose multiple case. We recommend the following for users interested in this case.

  1. Convert the matrix from CSR to CSC format using one of the csr2csc() functions. Notice that by interchanging the rows and columns of the result you are implicitly transposing the matrix.

  2. Call the gemvi() function with the cusparseOperation_t parameter set to CUSPARSE_OPERATION_NON_TRANSPOSE and with the interchanged rows and columns of the matrix stored in CSC format. This (implicitly) multiplies the vector by the transpose of the matrix in the original CSR format.

    • The routine requires no extra storage

    • The routine supports asynchronous execution

    • The routine supports CUDA graph capture

The function cusparse<t>gemvi_bufferSize() returns the size of buffer used in cusparse<t>gemvi().

Input

handle

Handle to the cuSPARSE library context.

trans

The operation \(\text{op}(A)\).

m

Number of rows of matrix A.

n

Number of columns of matrix A.

alpha

<type> scalar used for multiplication.

A

The pointer to dense matrix A.

lda

Size of the leading dimension of A.

nnz

Number of nonzero elements of vector x.

x

<type> sparse vector of nnz elements of size n if \(\text{op}(A)=A\), and size m if \(\text{op}(A)=A^{T}\).

xInd

Indices of non-zero values in x.

beta

<type> scalar used for multiplication. If beta is zero, y does not have to be a valid input.

y

<type> dense vector of m elements if \(\text{op}(A)=A\), and n elements if \(\text{op}(A)=A^{T}\).

idxBase

0 or 1, for 0 based or 1 based indexing, respectively.

pBufferSize

Number of elements needed the buffer used in cusparse<t>gemvi().

pBuffer

Working space buffer.

Output

y

<type> updated dense vector.

See cusparseStatus_t for the description of the return status.

5.5. cuSPARSE Level 3 Function Reference

This chapter describes sparse linear algebra functions that perform operations between sparse and (usually tall) dense matrices.

5.5.1. cusparse<t>bsrmm()

cusparseStatus_t
cusparseSbsrmm(cusparseHandle_t         handle,
               cusparseDirection_t      dirA,
               cusparseOperation_t      transA,
               cusparseOperation_t      transB,
               int                      mb,
               int                      n,
               int                      kb,
               int                      nnzb,
               const float*             alpha,
               const cusparseMatDescr_t descrA,
               const float*             bsrValA,
               const int*               bsrRowPtrA,
               const int*               bsrColIndA,
               int                      blockDim,
               const float*             B,
               int                      ldb,
               const float*             beta,
               float*                   C,
               int                      ldc)

cusparseStatus_t
cusparseDbsrmm(cusparseHandle_t         handle,
               cusparseDirection_t      dirA,
               cusparseOperation_t      transA,
               cusparseOperation_t      transB,
               int                      mb,
               int                      n,
               int                      kb,
               int                      nnzb,
               const double*            alpha,
               const cusparseMatDescr_t descrA,
               const double*            bsrValA,
               const int*               bsrRowPtrA,
               const int*               bsrColIndA,
               int                      blockDim,
               const double*            B,
               int                      ldb,
               const double*            beta,
               double*                  C,
               int                      ldc)

cusparseStatus_t
cusparseCbsrmm(cusparseHandle_t         handle,
               cusparseDirection_t      dirA,
               cusparseOperation_t      transA,
               cusparseOperation_t      transB,
               int                      mb,
               int                      n,
               int                      kb,
               int                      nnzb,
               const cuComplex*         alpha,
               const cusparseMatDescr_t descrA,
               const cuComplex*         bsrValA,
               const int*               bsrRowPtrA,
               const int*               bsrColIndA,
               int                      blockDim,
               const cuComplex*         B,
               int                      ldb,
               const cuComplex*         beta,
               cuComplex*               C,
               int                      ldc)

cusparseStatus_t
cusparseZbsrmm(cusparseHandle_t         handle,
               cusparseDirection_t      dirA,
               cusparseOperation_t      transA,
               cusparseOperation_t      transB,
               int                      mb,
               int                      n,
               int                      kb,
               int                      nnzb,
               const cuDoubleComplex*   alpha,
               const cusparseMatDescr_t descrA,
               const cuDoubleComplex*   bsrValA,
               const int*               bsrRowPtrA,
               const int*               bsrColIndA,
               int                      blockDim,
               const cuDoubleComplex*   B,
               int                      ldb,
               const cuDoubleComplex*   beta,
               cuDoubleComplex*         C,
               int                      ldc)

This function performs one of the following matrix-matrix operations:

\(C = \alpha \ast \text{op}(A) \ast \text{op}(B) + \beta \ast C\)

A is an mb×kb sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA; B and C are dense matrices; \(\alpha\text{~and~}\beta\) are scalars; and

\(\text{op}(A) = \begin{cases} A & \text{if transA == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if transA == CUSPARSE_OPERATION_TRANSPOSE (not\ supported)} \\ A^{H} & \text{if transA == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE (not supported)} \\ \end{cases}\)

and

\(\text{op}(B) = \begin{cases} B & \text{if transB == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ B^{T} & \text{if transB == CUSPARSE_OPERATION_TRANSPOSE} \\ B^{H} & \text{if transB == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE (not supported)} \\ \end{cases}\)

The function has the following limitations:

  • Only CUSPARSE_MATRIX_TYPE_GENERAL matrix type is supported

  • Only blockDim > 1 is supported

  • if blockDim ≤ 4, then max(mb)/max(n) = 524,272

  • if 4 < blockDim ≤ 8, then max(mb) = 524,272, max(n) = 262,136

  • if blockDim > 8, then m < 65,535 and max(n) = 262,136

The motivation of transpose(B) is to improve memory access of matrix B. The computational pattern of A*transpose(B) with matrix B in column-major order is equivalent to A*B with matrix B in row-major order.

In practice, no operation in an iterative solver or eigenvalue solver uses A*transpose(B). However, we can perform A*transpose(transpose(B)) which is the same as A*B. For example, suppose A is mb*kb, B is k*n and C is m*n, the following code shows usage of cusparseDbsrmm().

// A is mb*kb, B is k*n and C is m*n
    const int m = mb*blockSize;
    const int k = kb*blockSize;
    const int ldb_B = k; // leading dimension of B
    const int ldc   = m; // leading dimension of C
// perform C:=alpha*A*B + beta*C
    cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL );
    cusparseDbsrmm(cusparse_handle,
               CUSPARSE_DIRECTION_COLUMN,
               CUSPARSE_OPERATION_NON_TRANSPOSE,
               CUSPARSE_OPERATION_NON_TRANSPOSE,
               mb, n, kb, nnzb, alpha,
               descrA, bsrValA, bsrRowPtrA, bsrColIndA, blockSize,
               B, ldb_B,
               beta, C, ldc);

Instead of using A*B, our proposal is to transpose B to Bt by first calling cublas<t>geam(), and then to perform A*transpose(Bt).

// step 1: Bt := transpose(B)
    const int m = mb*blockSize;
    const int k = kb*blockSize;
    double *Bt;
    const int ldb_Bt = n; // leading dimension of Bt
    cudaMalloc((void**)&Bt, sizeof(double)*ldb_Bt*k);
    double one  = 1.0;
    double zero = 0.0;
    cublasSetPointerMode(cublas_handle, CUBLAS_POINTER_MODE_HOST);
    cublasDgeam(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_T,
        n, k, &one, B, int ldb_B, &zero, B, int ldb_B, Bt, ldb_Bt);

// step 2: perform C:=alpha*A*transpose(Bt) + beta*C
    cusparseDbsrmm(cusparse_handle,
               CUSPARSE_DIRECTION_COLUMN,
               CUSPARSE_OPERATION_NON_TRANSPOSE,
               CUSPARSE_OPERATION_TRANSPOSE,
               mb, n, kb, nnzb, alpha,
               descrA, bsrValA, bsrRowPtrA, bsrColIndA, blockSize,
               Bt, ldb_Bt,
               beta, C, ldc);

bsrmm() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

dir

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

transA

the operation op(A).

transB

the operation op(B).

mb

number of block rows of sparse matrix A.

n

number of columns of dense matrix op(B) and A.

kb

number of block columns of sparse matrix A.

nnzb

number of non-zero blocks of sparse matrix A.

alpha

<type> scalar used for multiplication.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A, larger than zero.

B

array of dimensions (ldb, n) if op(B)=B and (ldb,                                           k) otherwise.

ldb

leading dimension of B. If op(B)=B, it must be at least \(\max\text{(1,\ k)}\) If op(B) != B, it must be at least max(1,                                           n).

beta

<type> scalar used for multiplication. If beta is zero, C does not have to be a valid input.

C

array of dimensions (ldc, n).

ldc

leading dimension of C. It must be at least \(\max\text{(1,\ m)}\) if op(A)=A and at least \(\max\text{(1,\ k)}\) otherwise.

Output

C

<type> updated array of dimensions (ldc, n).

See cusparseStatus_t for the description of the return status

5.5.2. cusparse<t>bsrsm2_bufferSize() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsrsm2_bufferSize(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           cusparseOperation_t      transA,
                           cusparseOperation_t      transX,
                           int                      mb,
                           int                      n,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           float*                   bsrSortedValA,
                           const int*               bsrSortedRowPtrA,
                           const int*               bsrSortedColIndA,
                           int                      blockDim,
                           bsrsm2Info_t             info,
                           int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseDbsrsm2_bufferSize(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           cusparseOperation_t      transA,
                           cusparseOperation_t      transX,
                           int                      mb,
                           int                      n,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           double*                  bsrSortedValA,
                           const int*               bsrSortedRowPtrA,
                           const int*               bsrSortedColIndA,
                           int                      blockDim,
                           bsrsm2Info_t             info,
                           int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseCbsrsm2_bufferSize(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           cusparseOperation_t      transA,
                           cusparseOperation_t      transX,
                           int                      mb,
                           int                      n,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           cuComplex*               bsrSortedValA,
                           const int*               bsrSortedRowPtrA,
                           const int*               bsrSortedColIndA,
                           int                      blockDim,
                           bsrsm2Info_t             info,
                           int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseZbsrsm2_bufferSize(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           cusparseOperation_t      transA,
                           cusparseOperation_t      transX,
                           int                      mb,
                           int                      n,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           cuDoubleComplex*         bsrSortedValA,
                           const int*               bsrSortedRowPtrA,
                           const int*               bsrSortedColIndA,
                           int                      blockDim,
                           bsrsm2Info_t             info,
                           int*                     pBufferSizeInBytes)

This function returns size of buffer used in bsrsm2(), a new sparse triangular linear system op(A)*op(X)=\(\alpha\)op(B).

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA); B and X are the right-hand-side and the solution matrices; \(\alpha\) is a scalar; and

\(\text{op}(A) == \text{CUSPARSE_OPERATION_NON_TRANSPOSE}\)

Although there are six combinations in terms of parameter trans and the upper (and lower) triangular part of A, bsrsm2_bufferSize() returns the maximum size of the buffer among these combinations. The buffer size depends on dimension mb,blockDim and the number of nonzeros of the matrix, nnzb. If the user changes the matrix, it is necessary to call bsrsm2_bufferSize() again to get the correct buffer size, otherwise a segmentation fault may occur.

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

transA

the operation op(A).

transX

the operation op(X).

mb

number of block rows of matrix A.

n

number of columns of matrix op(B) and op(X).

nnzb

number of nonzero blocks of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A; larger than zero.

Output

info

record internal states based on different algorithms.

pBufferSizeInBytes

number of bytes of the buffer used in bsrsm2_analysis() and bsrsm2_solve().

See cusparseStatus_t for the description of the return status

5.5.3. cusparse<t>bsrsm2_analysis() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsrsm2_analysis(cusparseHandle_t         handle,
                         cusparseDirection_t      dirA,
                         cusparseOperation_t      transA,
                         cusparseOperation_t      transX,
                         int                      mb,
                         int                      n,
                         int                      nnzb,
                         const cusparseMatDescr_t descrA,
                         const float*             bsrSortedVal,
                         const int*               bsrSortedRowPtr,
                         const int*               bsrSortedColInd,
                         int                      blockDim,
                         bsrsm2Info_t             info,
                         cusparseSolvePolicy_t    policy,
                         void*                    pBuffer)

cusparseStatus_t
cusparseDbsrsm2_analysis(cusparseHandle_t         handle,
                         cusparseDirection_t      dirA,
                         cusparseOperation_t      transA,
                         cusparseOperation_t      transX,
                         int                      mb,
                         int                      n,
                         int                      nnzb,
                         const cusparseMatDescr_t descrA,
                         const double*            bsrSortedVal,
                         const int*               bsrSortedRowPtr,
                         const int*               bsrSortedColInd,
                         int                      blockDim,
                         bsrsm2Info_t             info,
                         cusparseSolvePolicy_t    policy,
                         void*                    pBuffer)

cusparseStatus_t
cusparseCbsrsm2_analysis(cusparseHandle_t         handle,
                         cusparseDirection_t      dirA,
                         cusparseOperation_t      transA,
                         cusparseOperation_t      transX,
                         int                      mb,
                         int                      n,
                         int                      nnzb,
                         const cusparseMatDescr_t descrA,
                         const cuComplex*         bsrSortedVal,
                         const int*               bsrSortedRowPtr,
                         const int*               bsrSortedColInd,
                         int                      blockDim,
                         bsrsm2Info_t             info,
                         cusparseSolvePolicy_t    policy,
                         void*                    pBuffer)

cusparseStatus_t
cusparseZbsrsm2_analysis(cusparseHandle_t         handle,
                         cusparseDirection_t      dirA,
                         cusparseOperation_t      transA,
                         cusparseOperation_t      transX,
                         int                      mb,
                         int                      n,
                         int                      nnzb,
                         const cusparseMatDescr_t descrA,
                         const cuDoubleComplex*   bsrSortedVal,
                         const int*               bsrSortedRowPtr,
                         const int*               bsrSortedColInd,
                         int                      blockDim,
                         bsrsm2Info_t             info,
                         cusparseSolvePolicy_t    policy,
                         void*                    pBuffer)

This function performs the analysis phase of bsrsm2(), a new sparse triangular linear system op(A)*op(X) =\(\alpha\)op(B).

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA); B and X are the right-hand-side and the solution matrices; \(\alpha\) is a scalar; and

\(\text{op}(A) == \text{CUSPARSE_OPERATION_NON_TRANSPOSE}\)

and

\(\text{op}(X) = \begin{cases} X & \text{if transX == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ X^{T} & \text{if transX == CUSPARSE_OPERATION_TRANSPOSE} \\ X^{H} & \text{if transX == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE (not supported)} \\ \end{cases}\)

and op(B) and op(X) are equal.

The block of BSR format is of size blockDim*blockDim, stored in column-major or row-major as determined by parameter dirA, which is either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

It is expected that this function will be executed only once for a given matrix and a particular operation type.

This function requires the buffer size returned by bsrsm2_bufferSize(). The address of pBuffer must be multiple of 128 bytes. If not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function bsrsm2_analysis() reports a structural zero and computes the level information stored in opaque structure info. The level information can extract more parallelism during a triangular solver. However bsrsm2_solve() can be done without level information. To disable level information, the user needs to specify the policy of the triangular solver as CUSPARSE_SOLVE_POLICY_NO_LEVEL.

Function bsrsm2_analysis() always reports the first structural zero, even if the parameter policy is CUSPARSE_SOLVE_POLICY_NO_LEVEL. Besides, no structural zero is reported if CUSPARSE_DIAG_TYPE_UNIT is specified, even if block A(j,j) is missing for some j. The user must call cusparseXbsrsm2_query_zero_pivot() to know where the structural zero is.

If bsrsm2_analysis() reports a structural zero, the solve will return a numerical zero in the same position as the structural zero but this result X is meaningless.

  • This function requires temporary extra storage that is allocated internally

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

transA

the operation op(A).

transX

the operation op(B) and op(X).

mb

number of block rows of matrix A.

n

number of columns of matrix op(B) and op(X).

nnzb

number of non-zero blocks of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A; larger than zero.

info

structure initialized using cusparseCreateBsrsm2Info.

policy

The supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is return by bsrsm2_bufferSize().

Output

info

structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged).

See cusparseStatus_t for the description of the return status

5.5.4. cusparse<t>bsrsm2_solve() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsrsm2_solve(cusparseHandle_t         handle,
                      cusparseDirection_t      dirA,
                      cusparseOperation_t      transA,
                      cusparseOperation_t      transX,
                      int                      mb,
                      int                      n,
                      int                      nnzb,
                      const float*             alpha,
                      const cusparseMatDescr_t descrA,
                      const float*             bsrSortedVal,
                      const int*               bsrSortedRowPtr,
                      const int*               bsrSortedColInd,
                      int                      blockDim,
                      bsrsm2Info_t             info,
                      const float*             B,
                      int                      ldb,
                      float*                   X,
                      int                      ldx,
                      cusparseSolvePolicy_t    policy,
                      void*                    pBuffer)

cusparseStatus_t
cusparseDbsrsm2_solve(cusparseHandle_t         handle,
                      cusparseDirection_t      dirA,
                      cusparseOperation_t      transA,
                      cusparseOperation_t      transX,
                      int                      mb,
                      int                      n,
                      int                      nnzb,
                      const double*            alpha,
                      const cusparseMatDescr_t descrA,
                      const double*            bsrSortedVal,
                      const int*               bsrSortedRowPtr,
                      const int*               bsrSortedColInd,
                      int                      blockDim,
                      bsrsm2Info_t             info,
                      const double*            B,
                      int                      ldb,
                      double*                  X,
                      int                      ldx,
                      cusparseSolvePolicy_t    policy,
                      void*                    pBuffer)

cusparseStatus_t
cusparseCbsrsm2_solve(cusparseHandle_t         handle,
                      cusparseDirection_t      dirA,
                      cusparseOperation_t      transA,
                      cusparseOperation_t      transX,
                      int                      mb,
                      int                      n,
                      int                      nnzb,
                      const cuComplex*         alpha,
                      const cusparseMatDescr_t descrA,
                      const cuComplex*         bsrSortedVal,
                      const int*               bsrSortedRowPtr,
                      const int*               bsrSortedColInd,
                      int                      blockDim,
                      bsrsm2Info_t             info,
                      const cuComplex*         B,
                      int                      ldb,
                      cuComplex*               X,
                      int                      ldx,
                      cusparseSolvePolicy_t    policy,
                      void*                    pBuffer)

cusparseStatus_t
cusparseZbsrsm2_solve(cusparseHandle_t         handle,
                      cusparseDirection_t      dirA,
                      cusparseOperation_t      transA,
                      cusparseOperation_t      transX,
                      int                      mb,
                      int                      n,
                      int                      nnzb,
                      const cuDoubleComplex*   alpha,
                      const cusparseMatDescr_t descrA,
                      const cuDoubleComplex*   bsrSortedVal,
                      const int*               bsrSortedRowPtr,
                      const int*               bsrSortedColInd,
                      int                      blockDim,
                      bsrsm2Info_t             info,
                      const cuDoubleComplex*   B,
                      int                      ldb,
                      cuDoubleComplex*         X,
                      int                      ldx,
                      cusparseSolvePolicy_t    policy,
                      void*                    pBuffer)

This function performs the solve phase of the solution of a sparse triangular linear system:

\(\text{op}(A) \ast \text{op(X)} = \alpha \ast \text{op(B)}\)

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA); B and X are the right-hand-side and the solution matrices; \(\alpha\) is a scalar, and

\(\text{op}(A) == \text{CUSPARSE_OPERATION_NON_TRANSPOSE}\)

and

\(\text{op}(X) = \begin{cases} X & \text{if transX == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ X^{T} & \text{if transX == CUSPARSE_OPERATION_TRANSPOSE} \\ X^{H} & \text{not supported} \\ \end{cases}\)

Only op(A)=A is supported.

op(B) and op(X) must be performed in the same way. In other words, if op(B)=B, op(X)=X.

The block of BSR format is of size blockDim*blockDim, stored as column-major or row-major as determined by parameter dirA, which is either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored. Function bsrsm02_solve() can support an arbitrary blockDim.

This function may be executed multiple times for a given matrix and a particular operation type.

This function requires the buffer size returned by bsrsm2_bufferSize(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Although bsrsm2_solve() can be done without level information, the user still needs to be aware of consistency. If bsrsm2_analysis() is called with policy CUSPARSE_SOLVE_POLICY_USE_LEVEL, bsrsm2_solve() can be run with or without levels. On the other hand, if bsrsm2_analysis() is called with CUSPARSE_SOLVE_POLICY_NO_LEVEL, bsrsm2_solve() can only accept CUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function bsrsm02_solve() has the same behavior as bsrsv02_solve(), reporting the first numerical zero, including a structural zero. The user must call cusparseXbsrsm2_query_zero_pivot() to know where the numerical zero is.

The motivation of transpose(X) is to improve the memory access of matrix X. The computational pattern of transpose(X) with matrix X in column-major order is equivalent to X with matrix X in row-major order.

In-place is supported and requires that B and X point to the same memory block, and ldb=ldx.

The function supports the following properties if pBuffer != NULL:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

transA

the operation op(A).

transX

the operation op(B) and op(X).

mb

number of block rows of matrix A.

n

number of columns of matrix op(B) and op(X).

nnzb

number of non-zero blocks of matrix A.

alpha

<type> scalar used for multiplication.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types are CUSPARSE_DIAG_TYPE_UNIT and CUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) non-zero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A; larger than zero.

info

structure initialized using cusparseCreateBsrsm2Info().

B

<type> right-hand-side array.

ldb

leading dimension of B. If op(B)=B, ldb >= (mb*blockDim); otherwise, ldb >= n.

ldx

leading dimension of X. If op(X)=X, then ldx >= (mb*blockDim). otherwise ldx >= n.

policy

the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned by bsrsm2_bufferSize().

Output

X

<type> solution array with leading dimensions ldx.

See cusparseStatus_t for the description of the return status.

5.5.5. cusparseXbsrsm2_zeroPivot() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseXbsrsm2_zeroPivot(cusparseHandle_t handle,
                          bsrsm2Info_t     info,
                          int*             position)

If the returned error code is CUSPARSE_STATUS_ZERO_PIVOT, position=j means A(j,j) is either a structural zero or a numerical zero (singular block). Otherwise position=-1.

The position can be 0-base or 1-base, the same as the matrix.

Function cusparseXbsrsm2_zeroPivot() is a blocking call. It calls cudaDeviceSynchronize() to make sure all previous kernels are done.

The position can be in the host memory or device memory. The user can set the proper mode with cusparseSetPointerMode().

  • The routine requires no extra storage

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

info

info contains a structural zero or a numerical zero if the user already called bsrsm2_analysis() or bsrsm2_solve().

Output

position

if no structural or numerical zero, position is -1; otherwise, if A(j,j) is missing or U(j,j) is zero, position=j.

See cusparseStatus_t for the description of the return status.

5.6. cuSPARSE Extra Function Reference

This chapter describes the extra routines used to manipulate sparse matrices.

5.6.1. cusparse<t>csrgeam2()

cusparseStatus_t
cusparseScsrgeam2_bufferSizeExt(cusparseHandle_t         handle,
                                int                      m,
                                int                      n,
                                const float*             alpha,
                                const cusparseMatDescr_t descrA,
                                int                      nnzA,
                                const float*             csrSortedValA,
                                const int*               csrSortedRowPtrA,
                                const int*               csrSortedColIndA,
                                const float*             beta,
                                const cusparseMatDescr_t descrB,
                                int                      nnzB,
                                const float*             csrSortedValB,
                                const int*               csrSortedRowPtrB,
                                const int*               csrSortedColIndB,
                                const cusparseMatDescr_t descrC,
                                const float*             csrSortedValC,
                                const int*               csrSortedRowPtrC,
                                const int*               csrSortedColIndC,
                                size_t*                  pBufferSizeInBytes)

cusparseStatus_t
cusparseDcsrgeam2_bufferSizeExt(cusparseHandle_t         handle,
                                int                      m,
                                int                      n,
                                const double*            alpha,
                                const cusparseMatDescr_t descrA,
                                int                      nnzA,
                                const double*            csrSortedValA,
                                const int*               csrSortedRowPtrA,
                                const int*               csrSortedColIndA,
                                const double*            beta,
                                const cusparseMatDescr_t descrB,
                                int                      nnzB,
                                const double*            csrSortedValB,
                                const int*               csrSortedRowPtrB,
                                const int*               csrSortedColIndB,
                                const cusparseMatDescr_t descrC,
                                const double*            csrSortedValC,
                                const int*               csrSortedRowPtrC,
                                const int*               csrSortedColIndC,
                                size_t*                  pBufferSizeInBytes)

cusparseStatus_t
cusparseCcsrgeam2_bufferSizeExt(cusparseHandle_t         handle,
                                int                      m,
                                int                      n,
                                const cuComplex*         alpha,
                                const cusparseMatDescr_t descrA,
                                int                      nnzA,
                                const cuComplex*         csrSortedValA,
                                const int*               csrSortedRowPtrA,
                                const int*               csrSortedColIndA,
                                const cuComplex*         beta,
                                const cusparseMatDescr_t descrB,
                                int                      nnzB,
                                const cuComplex*         csrSortedValB,
                                const int*               csrSortedRowPtrB,
                                const int*               csrSortedColIndB,
                                const cusparseMatDescr_t descrC,
                                const cuComplex*         csrSortedValC,
                                const int*               csrSortedRowPtrC,
                                const int*               csrSortedColIndC,
                                size_t*                  pBufferSizeInBytes)

cusparseStatus_t
cusparseZcsrgeam2_bufferSizeExt(cusparseHandle_t         handle,
                                int                      m,
                                int                      n,
                                const cuDoubleComplex*   alpha,
                                const cusparseMatDescr_t descrA,
                                int                      nnzA,
                                const cuDoubleComplex*   csrSortedValA,
                                const int*               csrSortedRowPtrA,
                                const int*               csrSortedColIndA,
                                const cuDoubleComplex*   beta,
                                const cusparseMatDescr_t descrB,
                                int                      nnzB,
                                const cuDoubleComplex*   csrSortedValB,
                                const int*               csrSortedRowPtrB,
                                const int*               csrSortedColIndB,
                                const cusparseMatDescr_t descrC,
                                const cuDoubleComplex*   csrSortedValC,
                                const int*               csrSortedRowPtrC,
                                const int*               csrSortedColIndC,
                                size_t*                  pBufferSizeInBytes)

cusparseStatus_t
cusparseXcsrgeam2Nnz(cusparseHandle_t         handle,
                     int                      m,
                     int                      n,
                     const cusparseMatDescr_t descrA,
                     int                      nnzA,
                     const int*               csrSortedRowPtrA,
                     const int*               csrSortedColIndA,
                     const cusparseMatDescr_t descrB,
                     int                      nnzB,
                     const int*               csrSortedRowPtrB,
                     const int*               csrSortedColIndB,
                     const cusparseMatDescr_t descrC,
                     int*                     csrSortedRowPtrC,
                     int*                     nnzTotalDevHostPtr,
                     void*                    workspace)
cusparseStatus_t
cusparseScsrgeam2(cusparseHandle_t         handle,
                  int                      m,
                  int                      n,
                  const float*             alpha,
                  const cusparseMatDescr_t descrA,
                  int                      nnzA,
                  const float*             csrSortedValA,
                  const int*               csrSortedRowPtrA,
                  const int*               csrSortedColIndA,
                  const float*             beta,
                  const cusparseMatDescr_t descrB,
                  int                      nnzB,
                  const float*             csrSortedValB,
                  const int*               csrSortedRowPtrB,
                  const int*               csrSortedColIndB,
                  const cusparseMatDescr_t descrC,
                  float*                   csrSortedValC,
                  int*                     csrSortedRowPtrC,
                  int*                     csrSortedColIndC,
                  void*                    pBuffer)

cusparseStatus_t
cusparseDcsrgeam2(cusparseHandle_t         handle,
                  int                      m,
                  int                      n,
                  const double*            alpha,
                  const cusparseMatDescr_t descrA,
                  int                      nnzA,
                  const double*            csrSortedValA,
                  const int*               csrSortedRowPtrA,
                  const int*               csrSortedColIndA,
                  const double*            beta,
                  const cusparseMatDescr_t descrB,
                  int                      nnzB,
                  const double*            csrSortedValB,
                  const int*               csrSortedRowPtrB,
                  const int*               csrSortedColIndB,
                  const cusparseMatDescr_t descrC,
                  double*                  csrSortedValC,
                  int*                     csrSortedRowPtrC,
                  int*                     csrSortedColIndC,
                  void*                    pBuffer)

cusparseStatus_t
cusparseCcsrgeam2(cusparseHandle_t         handle,
                  int                      m,
                  int                      n,
                  const cuComplex*         alpha,
                  const cusparseMatDescr_t descrA,
                  int                      nnzA,
                  const cuComplex*         csrSortedValA,
                  const int*               csrSortedRowPtrA,
                  const int*               csrSortedColIndA,
                  const cuComplex*         beta,
                  const cusparseMatDescr_t descrB,
                  int                      nnzB,
                  const cuComplex*         csrSortedValB,
                  const int*               csrSortedRowPtrB,
                  const int*               csrSortedColIndB,
                  const cusparseMatDescr_t descrC,
                  cuComplex*               csrSortedValC,
                  int*                     csrSortedRowPtrC,
                  int*                     csrSortedColIndC,
                  void*                    pBuffer)

cusparseStatus_t
cusparseZcsrgeam2(cusparseHandle_t         handle,
                  int                      m,
                  int                      n,
                  const cuDoubleComplex*   alpha,
                  const cusparseMatDescr_t descrA,
                  int                      nnzA,
                  const cuDoubleComplex*   csrSortedValA,
                  const int*               csrSortedRowPtrA,
                  const int*               csrSortedColIndA,
                  const cuDoubleComplex*   beta,
                  const cusparseMatDescr_t descrB,
                  int                      nnzB,
                  const cuDoubleComplex*   csrSortedValB,
                  const int*               csrSortedRowPtrB,
                  const int*               csrSortedColIndB,
                  const cusparseMatDescr_t descrC,
                  cuDoubleComplex*         csrSortedValC,
                  int*                     csrSortedRowPtrC,
                  int*                     csrSortedColIndC,
                  void*                    pBuffer)

This function performs following matrix-matrix operation

\(C = \alpha \ast A + \beta \ast B\)

where A, B, and C are m×n sparse matrices (defined in CSR storage format by the three arrays csrValA|csrValB|csrValC, csrRowPtrA|csrRowPtrB|csrRowPtrC, and csrColIndA|csrColIndB|csrcolIndC respectively), and \(\alpha\text{~and~}\beta\) are scalars. Since A and B have different sparsity patterns, cuSPARSE adopts a two-step approach to complete sparse matrix C. In the first step, the user allocates csrRowPtrC of m+1elements and uses function cusparseXcsrgeam2Nnz() to determine csrRowPtrC and the total number of nonzero elements. In the second step, the user gathers nnzC (number of nonzero elements of matrix C) from either (nnzC=*nnzTotalDevHostPtr) or (nnzC=csrRowPtrC(m)-csrRowPtrC(0)) and allocates csrValC,                                             csrColIndC of nnzC elements respectively, then finally calls function cusparse[S|D|C|Z]csrgeam2() to complete matrix C.

The general procedure is as follows:

int baseC, nnzC;
/* alpha, nnzTotalDevHostPtr points to host memory */
size_t BufferSizeInBytes;
char *buffer = NULL;
int *nnzTotalDevHostPtr = &nnzC;
cusparseSetPointerMode(handle, CUSPARSE_POINTER_MODE_HOST);
cudaMalloc((void**)&csrRowPtrC, sizeof(int)*(m+1));
/* prepare buffer */
cusparseScsrgeam2_bufferSizeExt(handle, m, n,
    alpha,
    descrA, nnzA,
    csrValA, csrRowPtrA, csrColIndA,
    beta,
    descrB, nnzB,
    csrValB, csrRowPtrB, csrColIndB,
    descrC,
    csrValC, csrRowPtrC, csrColIndC
    &bufferSizeInBytes
    );
cudaMalloc((void**)&buffer, sizeof(char)*bufferSizeInBytes);
cusparseXcsrgeam2Nnz(handle, m, n,
        descrA, nnzA, csrRowPtrA, csrColIndA,
        descrB, nnzB, csrRowPtrB, csrColIndB,
        descrC, csrRowPtrC, nnzTotalDevHostPtr,
        buffer);
if (NULL != nnzTotalDevHostPtr){
    nnzC = *nnzTotalDevHostPtr;
}else{
    cudaMemcpy(&nnzC, csrRowPtrC+m, sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(&baseC, csrRowPtrC, sizeof(int), cudaMemcpyDeviceToHost);
    nnzC -= baseC;
}
cudaMalloc((void**)&csrColIndC, sizeof(int)*nnzC);
cudaMalloc((void**)&csrValC, sizeof(float)*nnzC);
cusparseScsrgeam2(handle, m, n,
        alpha,
        descrA, nnzA,
        csrValA, csrRowPtrA, csrColIndA,
        beta,
        descrB, nnzB,
        csrValB, csrRowPtrB, csrColIndB,
        descrC,
        csrValC, csrRowPtrC, csrColIndC
        buffer);

Several comments on csrgeam2():

  • The other three combinations, NT, TN, and TT, are not supported by cuSPARSE. In order to do any one of the three, the user should use the routine csr2csc() to convert \(A\) | \(B\) to \(A^{T}\) | \(B^{T}\) .

  • Only CUSPARSE_MATRIX_TYPE_GENERAL is supported. If either A or B is symmetric or Hermitian, then the user must extend the matrix to a full one and reconfigure the MatrixType field of the descriptor to CUSPARSE_MATRIX_TYPE_GENERAL.

  • If the sparsity pattern of matrix C is known, the user can skip the call to function cusparseXcsrgeam2Nnz(). For example, suppose that the user has an iterative algorithm which would update A and B iteratively but keep the sparsity patterns. The user can call function cusparseXcsrgeam2Nnz() once to set up the sparsity pattern of C, then call function cusparse[S|D|C|Z]geam() only for each iteration.

  • The pointers alpha and beta must be valid.

  • When alpha or beta is zero, it is not considered a special case by cuSPARSE. The sparsity pattern of C is independent of the value of alpha and beta. If the user wants \(C = 0 \times A + 1 \times B^{T}\) , then csr2csc() is better than csrgeam2().

  • csrgeam2() is the same as csrgeam() except csrgeam2() needs explicit buffer where csrgeam() allocates the buffer internally.

  • This function requires temporary extra storage that is allocated internally

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

m

number of rows of sparse matrix A,B,C.

n

number of columns of sparse matrix A,B,C.

alpha

<type> scalar used for multiplication.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL only.

nnzA

number of nonzero elements of sparse matrix A.

csrValA

<type> array of nnzA\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrix A.

csrRowPtrA

integer array of m\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array of nnzA\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrix A.

beta

<type> scalar used for multiplication. If beta is zero, y does not have to be a valid input.

descrB

the descriptor of matrix B. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL only.

nnzB

number of nonzero elements of sparse matrix B.

csrValB

<type> array of nnzB\(( =\)csrRowPtrB(m)\(-\)csrRowPtrB(0)\()\) nonzero elements of matrix B.

csrRowPtrB

integer array of m\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndB

integer array of nnzB\(( =\)csrRowPtrB(m)\(-\)csrRowPtrB(0)\()\) column indices of the nonzero elements of matrix B.

descrC

the descriptor of matrix C. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL only.

Output

csrValC

<type> array of nnzC\(( =\)csrRowPtrC(m)\(-\)csrRowPtrC(0)\()\) nonzero elements of matrix C.

csrRowPtrC

integer array of m\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndC

integer array of nnzC\(( =\)csrRowPtrC(m)\(-\)csrRowPtrC(0)\()\) column indices of the nonzero elements of matrixC.

nnzTotalDevHostPtr

total number of nonzero elements in device or host memory. It is equal to (csrRowPtrC(m)-csrRowPtrC(0)).

See cusparseStatus_t for the description of the return status

5.7. cuSPARSE Preconditioners Reference

This chapter describes the routines that implement different preconditioners.

5.7.1. Incomplete Cholesky Factorization: level 0 [DEPRECATED]

Different algorithms for ic0 are discussed in this section.

5.7.1.1. cusparse<t>csric02_bufferSize() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseScsric02_bufferSize(cusparseHandle_t         handle,
                            int                      m,
                            int                      nnz,
                            const cusparseMatDescr_t descrA,
                            float*                   csrValA,
                            const int*               csrRowPtrA,
                            const int*               csrColIndA,
                            csric02Info_t            info,
                            int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseDcsric02_bufferSize(cusparseHandle_t         handle,
                            int                      m,
                            int                      nnz,
                            const cusparseMatDescr_t descrA,
                            double*                  csrValA,
                            const int*               csrRowPtrA,
                            const int*               csrColIndA,
                            csric02Info_t            info,
                            int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseCcsric02_bufferSize(cusparseHandle_t         handle,
                            int                      m,
                            int                      nnz,
                            const cusparseMatDescr_t descrA,
                            cuComplex*               csrValA,
                            const int*               csrRowPtrA,
                            const int*               csrColIndA,
                            csric02Info_t            info,
                            int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseZcsric02_bufferSize(cusparseHandle_t         handle,
                            int                      m,
                            int                      nnz,
                            const cusparseMatDescr_t descrA,
                            cuDoubleComplex*         csrValA,
                            const int*               csrRowPtrA,
                            const int*               csrColIndA,
                            csric02Info_t            info,
                            int*                     pBufferSizeInBytes)

This function returns size of buffer used in computing the incomplete-Cholesky factorization with \(0\) fill-in and no pivoting:

\(A \approx LL^{H}\)

A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA.

The buffer size depends on dimension m and nnz, the number of nonzeros of the matrix. If the user changes the matrix, it is necessary to call csric02_bufferSize() again to have the correct buffer size; otherwise, a segmentation fault may occur.

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrix A.

nnz

number of nonzeros of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

csrValA

<type> array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrix A.

csrRowPtrA

integer array of m\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrix A.

Output

info

record internal states based on different algorithms

pBufferSizeInBytes

number of bytes of the buffer used in csric02_analysis() and csric02()

See cusparseStatus_t for the description of the return status.

5.7.1.2. cusparse<t>csric02_analysis() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseScsric02_analysis(cusparseHandle_t         handle,
                          int                      m,
                          int                      nnz,
                          const cusparseMatDescr_t descrA,
                          const float*             csrValA,
                          const int*               csrRowPtrA,
                          const int*               csrColIndA,
                          csric02Info_t            info,
                          cusparseSolvePolicy_t    policy,
                          void*                    pBuffer)

cusparseStatus_t
cusparseDcsric02_analysis(cusparseHandle_t         handle,
                          int                      m,
                          int                      nnz,
                          const cusparseMatDescr_t descrA,
                          const double*            csrValA,
                          const int*               csrRowPtrA,
                          const int*               csrColIndA,
                          csric02Info_t            info,
                          cusparseSolvePolicy_t    policy,
                          void*                    pBuffer)

cusparseStatus_t
cusparseCcsric02_analysis(cusparseHandle_t         handle,
                          int                      m,
                          int                      nnz,
                          const cusparseMatDescr_t descrA,
                          const cuComplex*         csrValA,
                          const int*               csrRowPtrA,
                          const int*               csrColIndA,
                          csric02Info_t            info,
                          cusparseSolvePolicy_t    policy,
                          void*                    pBuffer)

cusparseStatus_t
cusparseZcsric02_analysis(cusparseHandle_t         handle,
                          int                      m,
                          int                      nnz,
                          const cusparseMatDescr_t descrA,
                          const cuDoubleComplex*   csrValA,
                          const int*               csrRowPtrA,
                          const int*               csrColIndA,
                          csric02Info_t            info,
                          cusparseSolvePolicy_t    policy,
                          void*                    pBuffer)

This function performs the analysis phase of the incomplete-Cholesky factorization with \(0\) fill-in and no pivoting:

\(A \approx LL^{H}\)

A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA.

This function requires a buffer size returned by csric02_bufferSize(). The address of pBuffer must be multiple of 128 bytes. If not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function csric02_analysis() reports a structural zero and computes level information stored in the opaque structure info. The level information can extract more parallelism during incomplete Cholesky factorization. However csric02() can be done without level information. To disable level information, the user must specify the policy of csric02_analysis() and csric02() as CUSPARSE_SOLVE_POLICY_NO_LEVEL.

Function csric02_analysis() always reports the first structural zero, even if the policy is CUSPARSE_SOLVE_POLICY_NO_LEVEL. The user needs to call cusparseXcsric02_zeroPivot() to know where the structural zero is.

It is the user’s choice whether to call csric02() if csric02_analysis() reports a structural zero. In this case, the user can still call csric02(), which will return a numerical zero at the same position as the structural zero. However the result is meaningless.

  • This function requires temporary extra storage that is allocated internally

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrix A.

nnz

number of nonzeros of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

csrValA

<type> array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrix A.

csrRowPtrA

integer array of m\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrix A.

info

structure initialized using cusparseCreateCsric02Info().

policy

the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned by csric02_bufferSize().

Output

info

number of bytes of the buffer used in csric02_analysis() and csric02()

See cusparseStatus_t for the description of the return status.

5.7.1.3. cusparse<t>csric02() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseScsric02(cusparseHandle_t         handle,
                 int                      m,
                 int                      nnz,
                 const cusparseMatDescr_t descrA,
                 float*                   csrValA_valM,
                 const int*               csrRowPtrA,
                 const int*               csrColIndA,
                 csric02Info_t            info,
                 cusparseSolvePolicy_t    policy,
                 void*                    pBuffer)

cusparseStatus_t
cusparseDcsric02(cusparseHandle_t         handle,
                 int                      m,
                 int                      nnz,
                 const cusparseMatDescr_t descrA,
                 double*                  csrValA_valM,
                 const int*               csrRowPtrA,
                 const int*               csrColIndA,
                 csric02Info_t            info,
                 cusparseSolvePolicy_t    policy,
                 void*                    pBuffer)

cusparseStatus_t
cusparseCcsric02(cusparseHandle_t         handle,
                 int                      m,
                 int                      nnz,
                 const cusparseMatDescr_t descrA,
                 cuComplex*               csrValA_valM,
                 const int*               csrRowPtrA,
                 const int*               csrColIndA,
                 csric02Info_t            info,
                 cusparseSolvePolicy_t    policy,
                 void*                    pBuffer)

cusparseStatus_t
cusparseZcsric02(cusparseHandle_t         handle,
                 int                      m,
                 int                      nnz,
                 const cusparseMatDescr_t descrA,
                 cuDoubleComplex*         csrValA_valM,
                 const int*               csrRowPtrA,
                 const int*               csrColIndA,
                 csric02Info_t            info,
                 cusparseSolvePolicy_t    policy,
                 void*                    pBuffer)

This function performs the solve phase of the computing the incomplete-Cholesky factorization with \(0\) fill-in and no pivoting:

\(A \approx LL^{H}\)

This function requires a buffer size returned by csric02_bufferSize(). The address of pBuffer must be a multiple of 128 bytes. If not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Although csric02() can be done without level information, the user still needs to be aware of consistency. If csric02_analysis() is called with policy CUSPARSE_SOLVE_POLICY_USE_LEVEL, csric02() can be run with or without levels. On the other hand, if csric02_analysis() is called with CUSPARSE_SOLVE_POLICY_NO_LEVEL, csric02() can only accept CUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function csric02() reports the first numerical zero, including a structural zero. The user must call cusparseXcsric02_zeroPivot() to know where the numerical zero is.

Function csric02() only takes the lower triangular part of matrix A to perform factorization. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, the fill mode and diagonal type are ignored, and the strictly upper triangular part is ignored and never touched. It does not matter if A is Hermitian or not. In other words, from the point of view of csric02()A is Hermitian and only the lower triangular part is provided.

Note

In practice, a positive definite matrix may not have incomplete cholesky factorization. To the best of our knowledge, only matrix M can guarantee the existence of incomplete cholesky factorization. If csric02() failed cholesky factorization and reported a numerical zero, it is possible that incomplete cholesky factorization does not exist.

For example, suppose A is a real m × m matrix, the following code solves the precondition system M*y = x where M is the product of Cholesky factorization L and its transpose.

\(M = LL^{H}\)

// Suppose that A is m x m sparse matrix represented by CSR format,
// Assumption:
// - handle is already created by cusparseCreate(),
// - (d_csrRowPtr, d_csrColInd, d_csrVal) is CSR of A on device memory,
// - d_x is right hand side vector on device memory,
// - d_y is solution vector on device memory.
// - d_z is intermediate result on device memory.

cusparseMatDescr_t descr_M = 0;
cusparseMatDescr_t descr_L = 0;
csric02Info_t info_M  = 0;
csrsv2Info_t  info_L  = 0;
csrsv2Info_t  info_Lt = 0;
int pBufferSize_M;
int pBufferSize_L;
int pBufferSize_Lt;
int pBufferSize;
void *pBuffer = 0;
int structural_zero;
int numerical_zero;
const double alpha = 1.;
const cusparseSolvePolicy_t policy_M  = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
const cusparseSolvePolicy_t policy_L  = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
const cusparseSolvePolicy_t policy_Lt = CUSPARSE_SOLVE_POLICY_USE_LEVEL;
const cusparseOperation_t trans_L  = CUSPARSE_OPERATION_NON_TRANSPOSE;
const cusparseOperation_t trans_Lt = CUSPARSE_OPERATION_TRANSPOSE;

// step 1: create a descriptor which contains
// - matrix M is base-1
// - matrix L is base-1
// - matrix L is lower triangular
// - matrix L has non-unit diagonal
cusparseCreateMatDescr(&descr_M);
cusparseSetMatIndexBase(descr_M, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatType(descr_M, CUSPARSE_MATRIX_TYPE_GENERAL);

cusparseCreateMatDescr(&descr_L);
cusparseSetMatIndexBase(descr_L, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatType(descr_L, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatFillMode(descr_L, CUSPARSE_FILL_MODE_LOWER);
cusparseSetMatDiagType(descr_L, CUSPARSE_DIAG_TYPE_NON_UNIT);

// step 2: create a empty info structure
// we need one info for csric02 and two info's for csrsv2
cusparseCreateCsric02Info(&info_M);
cusparseCreateCsrsv2Info(&info_L);
cusparseCreateCsrsv2Info(&info_Lt);

// step 3: query how much memory used in csric02 and csrsv2, and allocate the buffer
cusparseDcsric02_bufferSize(handle, m, nnz,
    descr_M, d_csrVal, d_csrRowPtr, d_csrColInd, info_M, &bufferSize_M);
cusparseDcsrsv2_bufferSize(handle, trans_L, m, nnz,
    descr_L, d_csrVal, d_csrRowPtr, d_csrColInd, info_L, &pBufferSize_L);
cusparseDcsrsv2_bufferSize(handle, trans_Lt, m, nnz,
    descr_L, d_csrVal, d_csrRowPtr, d_csrColInd, info_Lt,&pBufferSize_Lt);

pBufferSize = max(bufferSize_M, max(pBufferSize_L, pBufferSize_Lt));

// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.
cudaMalloc((void**)&pBuffer, pBufferSize);

// step 4: perform analysis of incomplete Cholesky on M
//         perform analysis of triangular solve on L
//         perform analysis of triangular solve on L'
// The lower triangular part of M has the same sparsity pattern as L, so
// we can do analysis of csric02 and csrsv2 simultaneously.

cusparseDcsric02_analysis(handle, m, nnz, descr_M,
    d_csrVal, d_csrRowPtr, d_csrColInd, info_M,
    policy_M, pBuffer);
status = cusparseXcsric02_zeroPivot(handle, info_M, &structural_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
   printf("A(%d,%d) is missing\n", structural_zero, structural_zero);
}

cusparseDcsrsv2_analysis(handle, trans_L, m, nnz, descr_L,
    d_csrVal, d_csrRowPtr, d_csrColInd,
    info_L, policy_L, pBuffer);

cusparseDcsrsv2_analysis(handle, trans_Lt, m, nnz, descr_L,
    d_csrVal, d_csrRowPtr, d_csrColInd,
    info_Lt, policy_Lt, pBuffer);

// step 5: M = L * L'
cusparseDcsric02(handle, m, nnz, descr_M,
    d_csrVal, d_csrRowPtr, d_csrColInd, info_M, policy_M, pBuffer);
status = cusparseXcsric02_zeroPivot(handle, info_M, &numerical_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
   printf("L(%d,%d) is zero\n", numerical_zero, numerical_zero);
}

// step 6: solve L*z = x
cusparseDcsrsv2_solve(handle, trans_L, m, nnz, &alpha, descr_L, // replace with cusparseSpSV
   d_csrVal, d_csrRowPtr, d_csrColInd, info_L,
   d_x, d_z, policy_L, pBuffer);

// step 7: solve L'*y = z
cusparseDcsrsv2_solve(handle, trans_Lt, m, nnz, &alpha, descr_L, // replace with cusparseSpSV
   d_csrVal, d_csrRowPtr, d_csrColInd, info_Lt,
   d_z, d_y, policy_Lt, pBuffer);

// step 6: free resources
cudaFree(pBuffer);
cusparseDestroyMatDescr(descr_M);
cusparseDestroyMatDescr(descr_L);
cusparseDestroyCsric02Info(info_M);
cusparseDestroyCsrsv2Info(info_L);
cusparseDestroyCsrsv2Info(info_Lt);
cusparseDestroy(handle);

The function supports the following properties if pBuffer != NULL

  • This function requires temporary extra storage that is allocated internally

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrix A.

nnz

number of nonzeros of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

csrValA_valM

<type> array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrix A.

csrRowPtrA

integer array of m\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrix A.

info

structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged).

policy

the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned by csric02_bufferSize().

Output

csrValA_valM

<type> matrix containing the incomplete-Cholesky lower triangular factor.

See cusparseStatus_t for the description of the return status.

5.7.1.4. cusparseXcsric02_zeroPivot() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseXcsric02_zeroPivot(cusparseHandle_t handle,
                           csric02Info_t    info,
                           int*             position)

If the returned error code is CUSPARSE_STATUS_ZERO_PIVOT, position=j means A(j,j) has either a structural zero or a numerical zero; otherwise, position=-1.

The position can be 0-based or 1-based, the same as the matrix.

Function cusparseXcsric02_zeroPivot() is a blocking call. It calls cudaDeviceSynchronize() to make sure all previous kernels are done.

The position can be in the host memory or device memory. The user can set proper mode with cusparseSetPointerMode().

  • The routine requires no extra storage

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

info

info contains structural zero or numerical zero if the user already called csric02_analysis() or csric02().

Output

position

if no structural or numerical zero, position is -1; otherwise, if A(j,j) is missing or L(j,j) is zero, position=j.

See cusparseStatus_t for the description of the return status.

5.7.1.5. cusparse<t>bsric02_bufferSize() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsric02_bufferSize(cusparseHandle_t         handle,
                            cusparseDirection_t      dirA,
                            int                      mb,
                            int                      nnzb,
                            const cusparseMatDescr_t descrA,
                            float*                   bsrValA,
                            const int*               bsrRowPtrA,
                            const int*               bsrColIndA,
                            int                      blockDim,
                            bsric02Info_t            info,
                            int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseDbsric02_bufferSize(cusparseHandle_t         handle,
                            cusparseDirection_t      dirA,
                            int                      mb,
                            int                      nnzb,
                            const cusparseMatDescr_t descrA,
                            double*                  bsrValA,
                            const int*               bsrRowPtrA,
                            const int*               bsrColIndA,
                            int                      blockDim,
                            bsric02Info_t            info,
                            int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseCbsric02_bufferSize(cusparseHandle_t         handle,
                            cusparseDirection_t      dirA,
                            int                      mb,
                            int                      nnzb,
                            const cusparseMatDescr_t descrA,
                            cuComplex*               bsrValA,
                            const int*               bsrRowPtrA,
                            const int*               bsrColIndA,
                            int                      blockDim,
                            bsric02Info_t            info,
                            int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseZbsric02_bufferSize(cusparseHandle_t         handle,
                            cusparseDirection_t      dirA,
                            int                      mb,
                            int                      nnzb,
                            const cusparseMatDescr_t descrA,
                            cuDoubleComplex*         bsrValA,
                            const int*               bsrRowPtrA,
                            const int*               bsrColIndA,
                            int                      blockDim,
                            bsric02Info_t            info,
                            int*                     pBufferSizeInBytes)

This function returns the size of a buffer used in computing the incomplete-Cholesky factorization with 0 fill-in and no pivoting

\(A \approx LL^{H}\)

A is an (mb*blockDim)*(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA.

The buffer size depends on the dimensions of mb, blockDim, and the number of nonzero blocks of the matrix nnzb. If the user changes the matrix, it is necessary to call bsric02_bufferSize() again to have the correct buffer size; otherwise, a segmentation fault may occur.

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and block columns of matrix A.

nnzb

number of nonzero blocks of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A, larger than zero.

Output

info

record internal states based on different algorithms.

pBufferSizeInBytes

number of bytes of the buffer used in bsric02_analysis() and bsric02().

See cusparseStatus_t for the description of the return status.

5.7.1.6. cusparse<t>bsric02_analysis() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsric02_analysis(cusparseHandle_t        handle,
                          cusparseDirection_t     dirA,
                          int                      mb,
                          int                      nnzb,
                          const cusparseMatDescr_t descrA,
                          const float*             bsrValA,
                          const int*               bsrRowPtrA,
                          const int*               bsrColIndA,
                          int                      blockDim,
                          bsric02Info_t            info,
                          cusparseSolvePolicy_t    policy,
                          void*                    pBuffer)

cusparseStatus_t
cusparseDbsric02_analysis(cusparseHandle_t        handle,
                          cusparseDirection_t     dirA,
                          int                      mb,
                          int                      nnzb,
                          const cusparseMatDescr_t descrA,
                          const double*            bsrValA,
                          const int*               bsrRowPtrA,
                          const int*               bsrColIndA,
                          int                      blockDim,
                          bsric02Info_t            info,
                          cusparseSolvePolicy_t    policy,
                          void*                    pBuffer)

cusparseStatus_t
cusparseCbsric02_analysis(cusparseHandle_t        handle,
                          cusparseDirection_t     dirA,
                          int                      mb,
                          int                      nnzb,
                          const cusparseMatDescr_t descrA,
                          const cuComplex*         bsrValA,
                          const int*               bsrRowPtrA,
                          const int*               bsrColIndA,
                          int                      blockDim,
                          bsric02Info_t            info,
                          cusparseSolvePolicy_t    policy,
                          void*                    pBuffer)

cusparseStatus_t
cusparseZbsric02_analysis(cusparseHandle_t        handle,
                          cusparseDirection_t     dirA,
                          int                      mb,
                          int                      nnzb,
                          const cusparseMatDescr_t descrA,
                          const cuDoubleComplex*   bsrValA,
                          const int*               bsrRowPtrA,
                          const int*               bsrColIndA,
                          int                      blockDim,
                          bsric02Info_t            info,
                          cusparseSolvePolicy_t    policy,
                          void*                    pBuffer)

This function performs the analysis phase of the incomplete-Cholesky factorization with 0 fill-in and no pivoting

\(A \approx LL^{H}\)

A is an (mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA. The block in BSR format is of size blockDim*blockDim, stored as column-major or row-major as determined by parameter dirA, which is either CUSPARSE_DIRECTION_COLUMN or CUSPARSE_DIRECTION_ROW. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

This function requires a buffer size returned by bsric02_bufferSize90. The address of pBuffer must be a multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Functionbsric02_analysis() reports structural zero and computes level information stored in the opaque structure info. The level information can extract more parallelism during incomplete Cholesky factorization. However bsric02() can be done without level information. To disable level information, the user needs to specify the parameter policy of bsric02[_analysis| ] as CUSPARSE_SOLVE_POLICY_NO_LEVEL.

Function bsric02_analysis always reports the first structural zero, even when parameter policy is CUSPARSE_SOLVE_POLICY_NO_LEVEL. The user must call cusparseXbsric02_zeroPivot() to know where the structural zero is.

It is the user’s choice whether to call bsric02() if bsric02_analysis() reports a structural zero. In this case, the user can still call bsric02(), which returns a numerical zero in the same position as the structural zero. However the result is meaningless.

  • This function requires temporary extra storage that is allocated internally

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and block columns of matrix A.

nnzb

number of nonzero blocks of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A; must be larger than zero.

info

structure initialized using cusparseCreateBsric02Info().

policy

the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned by bsric02_bufferSize().

Output

info

Structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged).

See cusparseStatus_t for the description of the return status.

5.7.1.7. cusparse<t>bsric02() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsric02(cusparseHandle_t         handle,
                 cusparseDirection_t      dirA,
                 int                      mb,
                 int                      nnzb,
                 const cusparseMatDescr_t descrA,
                 float*                   bsrValA,
                 const int*               bsrRowPtrA,
                 const int*               bsrColIndA,
                 int                      blockDim,
                 bsric02Info_t            info,
                 cusparseSolvePolicy_t    policy,
                 void*                    pBuffer)

cusparseStatus_t
cusparseDbsric02(cusparseHandle_t         handle,
                 cusparseDirection_t      dirA,
                 int                      mb,
                 int                      nnzb,
                 const cusparseMatDescr_t descrA,
                 double*                  bsrValA,
                 const int*               bsrRowPtrA,
                 const int*               bsrColIndA,
                 int                      blockDim,
                 bsric02Info_t            info,
                 cusparseSolvePolicy_t    policy,
                 void*                    pBuffer)

cusparseStatus_t
cusparseCbsric02(cusparseHandle_t         handle,
                 cusparseDirection_t      dirA,
                 int                      mb,
                 int                      nnzb,
                 const cusparseMatDescr_t descrA,
                 cuComplex*               bsrValA,
                 const int*               bsrRowPtrA,
                 const int*               bsrColIndA,
                 int                      blockDim,
                 bsric02Info_t            info,
                 cusparseSolvePolicy_t    policy,
                 void*                    pBuffer)

cusparseStatus_t
cusparseZbsric02(cusparseHandle_t         handle,
                 cusparseDirection_t      dirA,
                 int                      mb,
                 int                      nnzb,
                 const cusparseMatDescr_t descrA,
                 cuDoubleComplex*         bsrValA,
                 const int*               bsrRowPtrA,
                 const int*               bsrColIndA,
                 int                      blockDim,
                 bsric02Info_t            info,
                 cusparseSolvePolicy_t    policy,
                 void*                    pBuffer)

This function performs the solve phase of the incomplete-Cholesky factorization with 0 fill-in and no pivoting

\(A \approx LL^{H}\)

A is an (mb*blockDim)×(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA. The block in BSR format is of size blockDim*blockDim, stored as column-major or row-major as determined by parameter dirA, which is either CUSPARSE_DIRECTION_COLUMN or CUSPARSE_DIRECTION_ROW. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

This function requires a buffer size returned by bsric02_bufferSize(). The address of pBuffer must be a multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Although bsric02() can be done without level information, the user must be aware of consistency. If bsric02_analysis() is called with policy CUSPARSE_SOLVE_POLICY_USE_LEVEL, bsric02() can be run with or without levels. On the other hand, if bsric02_analysis() is called with CUSPARSE_SOLVE_POLICY_NO_LEVEL, bsric02() can only accept CUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function bsric02() has the same behavior as csric02(). That is, bsr2csr(bsric02(A)) = csric02(bsr2csr(A)). The numerical zero of csric02() means there exists some zero L(j,j). The numerical zero of bsric02() means there exists some block Lj,j) that is not invertible.

Function bsric02 reports the first numerical zero, including a structural zero. The user must call cusparseXbsric02_zeroPivot() to know where the numerical zero is.

The bsric02() function only takes the lower triangular part of matrix A to perform factorization. The strictly upper triangular part is ignored and never touched. It does not matter if A is Hermitian or not. In other words, from the point of view of bsric02(), A is Hermitian and only the lower triangular part is provided. Moreover, the imaginary part of diagonal elements of diagonal blocks is ignored.

For example, suppose A is a real m-by-m matrix, where m=mb*blockDim. The following code solves precondition system M*y =                                  x, where M is the product of Cholesky factorization L and its transpose.

\(M = LL^{H}\)

// Suppose that A is m x m sparse matrix represented by BSR format,
// The number of block rows/columns is mb, and
// the number of nonzero blocks is nnzb.
// Assumption:
// - handle is already created by cusparseCreate(),
// - (d_bsrRowPtr, d_bsrColInd, d_bsrVal) is BSR of A on device memory,
// - d_x is right hand side vector on device memory,
// - d_y is solution vector on device memory.
// - d_z is intermediate result on device memory.
// - d_x, d_y and d_z are of size m.
cusparseMatDescr_t descr_M = 0;
cusparseMatDescr_t descr_L = 0;
bsric02Info_t info_M  = 0;
bsrsv2Info_t  info_L  = 0;
bsrsv2Info_t  info_Lt = 0;
int pBufferSize_M;
int pBufferSize_L;
int pBufferSize_Lt;
int pBufferSize;
void *pBuffer = 0;
int structural_zero;
int numerical_zero;
const double alpha = 1.;
const cusparseSolvePolicy_t policy_M  = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
const cusparseSolvePolicy_t policy_L  = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
const cusparseSolvePolicy_t policy_Lt = CUSPARSE_SOLVE_POLICY_USE_LEVEL;
const cusparseOperation_t trans_L  = CUSPARSE_OPERATION_NON_TRANSPOSE;
const cusparseOperation_t trans_Lt = CUSPARSE_OPERATION_TRANSPOSE;
const cusparseDirection_t dir = CUSPARSE_DIRECTION_COLUMN;

// step 1: create a descriptor which contains
// - matrix M is base-1
// - matrix L is base-1
// - matrix L is lower triangular
// - matrix L has non-unit diagonal
cusparseCreateMatDescr(&descr_M);
cusparseSetMatIndexBase(descr_M, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatType(descr_M, CUSPARSE_MATRIX_TYPE_GENERAL);

cusparseCreateMatDescr(&descr_L);
cusparseSetMatIndexBase(descr_L, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatType(descr_L, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatFillMode(descr_L, CUSPARSE_FILL_MODE_LOWER);
cusparseSetMatDiagType(descr_L, CUSPARSE_DIAG_TYPE_NON_UNIT);

// step 2: create a empty info structure
// we need one info for bsric02 and two info's for bsrsv2
cusparseCreateBsric02Info(&info_M);
cusparseCreateBsrsv2Info(&info_L);
cusparseCreateBsrsv2Info(&info_Lt);

// step 3: query how much memory used in bsric02 and bsrsv2, and allocate the buffer
cusparseDbsric02_bufferSize(handle, dir, mb, nnzb,
    descr_M, d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_M, &bufferSize_M);
cusparseDbsrsv2_bufferSize(handle, dir, trans_L, mb, nnzb,
    descr_L, d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_L, &pBufferSize_L);
cusparseDbsrsv2_bufferSize(handle, dir, trans_Lt, mb, nnzb,
    descr_L, d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_Lt, &pBufferSize_Lt);

pBufferSize = max(bufferSize_M, max(pBufferSize_L, pBufferSize_Lt));

// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.
cudaMalloc((void**)&pBuffer, pBufferSize);

// step 4: perform analysis of incomplete Cholesky on M
//         perform analysis of triangular solve on L
//         perform analysis of triangular solve on L'
// The lower triangular part of M has the same sparsity pattern as L, so
// we can do analysis of bsric02 and bsrsv2 simultaneously.

cusparseDbsric02_analysis(handle, dir, mb, nnzb, descr_M,
    d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_M,
    policy_M, pBuffer);
status = cusparseXbsric02_zeroPivot(handle, info_M, &structural_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
   printf("A(%d,%d) is missing\n", structural_zero, structural_zero);
}

cusparseDbsrsv2_analysis(handle, dir, trans_L, mb, nnzb, descr_L,
    d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim,
    info_L, policy_L, pBuffer);

cusparseDbsrsv2_analysis(handle, dir, trans_Lt, mb, nnzb, descr_L,
    d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim,
    info_Lt, policy_Lt, pBuffer);

// step 5: M = L * L'
cusparseDbsric02_solve(handle, dir, mb, nnzb, descr_M,
    d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_M, policy_M, pBuffer);
status = cusparseXbsric02_zeroPivot(handle, info_M, &numerical_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
   printf("L(%d,%d) is not positive definite\n", numerical_zero, numerical_zero);
}

// step 6: solve L*z = x
cusparseDbsrsv2_solve(handle, dir, trans_L, mb, nnzb, &alpha, descr_L,
   d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_L,
   d_x, d_z, policy_L, pBuffer);

// step 7: solve L'*y = z
cusparseDbsrsv2_solve(handle, dir, trans_Lt, mb, nnzb, &alpha, descr_L,
   d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_Lt,
   d_z, d_y, policy_Lt, pBuffer);

// step 6: free resources
cudaFree(pBuffer);
cusparseDestroyMatDescr(descr_M);
cusparseDestroyMatDescr(descr_L);
cusparseDestroyBsric02Info(info_M);
cusparseDestroyBsrsv2Info(info_L);
cusparseDestroyBsrsv2Info(info_Lt);
cusparseDestroy(handle);

The function supports the following properties if pBuffer != NULL

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and block columns of matrix A.

nnzb

number of nonzero blocks of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A, larger than zero.

info

structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged).

policy

the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user, the size is returned by bsric02_bufferSize().

Output

bsrValA

<type> matrix containing the incomplete-Cholesky lower triangular factor.

See cusparseStatus_t for the description of the return status.

5.7.1.8. cusparseXbsric02_zeroPivot() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseXbsric02_zeroPivot(cusparseHandle_t handle,
                           bsric02Info_t    info,
                           int*             position)

If the returned error code is CUSPARSE_STATUS_ZERO_PIVOT, position=j means A(j,j) has either a structural zero or a numerical zero (the block is not positive definite). Otherwise position=-1.

The position can be 0-based or 1-based, the same as the matrix.

Function cusparseXbsric02_zeroPivot() is a blocking call. It calls cudaDeviceSynchronize() to make sure all previous kernels are done.

The position can be in the host memory or device memory. The user can set the proper mode with cusparseSetPointerMode().

  • The routine requires no extra storage

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

info

info contains a structural zero or a numerical zero if the user already called bsric02_analysis() or bsric02().

Output

position

If no structural or numerical zero, position is -1, otherwise if A(j,j) is missing or L(j,j) is not positive definite, position=j.

See cusparseStatus_t for the description of the return status.

5.7.2. Incomplete LU Factorization: level 0 [DEPRECATED]

Different algorithms for ilu0 are discussed in this section.

5.7.2.1. cusparse<t>csrilu02_numericBoost() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseScsrilu02_numericBoost(cusparseHandle_t handle,
                               csrilu02Info_t   info,
                               int              enable_boost,
                               double*          tol,
                               float*           boost_val)

cusparseStatus_t
cusparseDcsrilu02_numericBoost(cusparseHandle_t handle,
                               csrilu02Info_t   info,
                               int              enable_boost,
                               double*          tol,
                               double*          boost_val)

cusparseStatus_t
cusparseCcsrilu02_numericBoost(cusparseHandle_t handle,
                               csrilu02Info_t   info,
                               int              enable_boost,
                               double*          tol,
                               cuComplex*       boost_val)

cusparseStatus_t
cusparseZcsrilu02_numericBoost(cusparseHandle_t handle,
                               csrilu02Info_t   info,
                               int              enable_boost,
                               double*          tol,
                               cuDoubleComplex* boost_val)

The user can use a boost value to replace a numerical value in incomplete LU factorization. The tol is used to determine a numerical zero, and the boost_val is used to replace a numerical zero. The behavior is

if tol >= fabs(A(j,j)), then A(j,j)=boost_val.

To enable a boost value, the user has to set parameter enable_boost to 1 before calling csrilu02(). To disable a boost value, the user can call csrilu02_numericBoost() again with parameter enable_boost=0.

If enable_boost=0, tol and boost_val are ignored.

Both tol and boost_val can be in the host memory or device memory. The user can set the proper mode with cusparseSetPointerMode().

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context

info

structure initialized using cusparseCreateCsrilu02Info()

enable_boost

disable boost by enable_boost=0; otherwise, boost is enabled

tol

tolerance to determine a numerical zero

boost_val

boost value to replace a numerical zero

See cusparseStatus_t for the description of the return status.

5.7.2.2. cusparse<t>csrilu02_bufferSize() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseScsrilu02_bufferSize(cusparseHandle_t         handle,
                             int                      m,
                             int                      nnz,
                             const cusparseMatDescr_t descrA,
                             float*                   csrValA,
                             const int*               csrRowPtrA,
                             const int*               csrColIndA,
                             csrilu02Info_t           info,
                             int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseDcsrilu02_bufferSize(cusparseHandle_t         handle,
                             int                      m,
                             int                      nnz,
                             const cusparseMatDescr_t descrA,
                             double*                  csrValA,
                             const int*               csrRowPtrA,
                             const int*               csrColIndA,
                             csrilu02Info_t           info,
                             int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseCcsrilu02_bufferSize(cusparseHandle_t         handle,
                             int                      m,
                             int                      nnz,
                             const cusparseMatDescr_t descrA,
                             cuComplex*               csrValA,
                             const int*               csrRowPtrA,
                             const int*               csrColIndA,
                             csrilu02Info_t           info,
                             int*                     pBufferSizeInBytes)

cusparseStatus_t
cusparseZcsrilu02_bufferSize(cusparseHandle_t         handle,
                             int                      m,
                             int                      nnz,
                             const cusparseMatDescr_t descrA,
                             cuDoubleComplex*         csrValA,
                             const int*               csrRowPtrA,
                             const int*               csrColIndA,
                             csrilu02Info_t           info,
                             int*                     pBufferSizeInBytes)

This function returns size of the buffer used in computing the incomplete-LU factorization with \(0\) fill-in and no pivoting:

\(A \approx LU\)

A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA.

The buffer size depends on the dimension m and nnz, the number of nonzeros of the matrix. If the user changes the matrix, it is necessary to call csrilu02_bufferSize() again to have the correct buffer size; otherwise, a segmentation fault may occur.

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrix A.

nnz

number of nonzeros of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

csrValA

<type> array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrix A.

csrRowPtrA

integer array of m\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrix A.

Output

info

record internal states based on different algorithms

pBufferSizeInBytes

number of bytes of the buffer used in csrilu02_analysis() and csrilu02()

See cusparseStatus_t for the description of the return status.

5.7.2.3. cusparse<t>csrilu02_analysis() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseScsrilu02_analysis(cusparseHandle_t         handle,
                           int                      m,
                           int                      nnz,
                           const cusparseMatDescr_t descrA,
                           const float*             csrValA,
                           const int*               csrRowPtrA,
                           const int*               csrColIndA,
                           csrilu02Info_t           info,
                           cusparseSolvePolicy_t    policy,
                           void*                    pBuffer)

cusparseStatus_t
cusparseDcsrilu02_analysis(cusparseHandle_t         handle,
                           int                      m,
                           int                      nnz,
                           const cusparseMatDescr_t descrA,
                           const double*            csrValA,
                           const int*               csrRowPtrA,
                           const int*               csrColIndA,
                           csrilu02Info_t           info,
                           cusparseSolvePolicy_t    policy,
                           void*                    pBuffer)

cusparseStatus_t
cusparseCcsrilu02_analysis(cusparseHandle_t         handle,
                           int                      m,
                           int                      nnz,
                           const cusparseMatDescr_t descrA,
                           const cuComplex*         csrValA,
                           const int*               csrRowPtrA,
                           const int*               csrColIndA,
                           csrilu02Info_t           info,
                           cusparseSolvePolicy_t    policy,
                           void*                    pBuffer)

cusparseStatus_t
cusparseZcsrilu02_analysis(cusparseHandle_t         handle,
                           int                      m,
                           int                      nnz,
                           const cusparseMatDescr_t descrA,
                           const cuDoubleComplex*   csrValA,
                           const int*               csrRowPtrA,
                           const int*               csrColIndA,
                           csrilu02Info_t           info,
                           cusparseSolvePolicy_t    policy,
                           void*                    pBuffer)

This function performs the analysis phase of the incomplete-LU factorization with \(0\) fill-in and no pivoting:

\(A \approx LU\)

A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA, csrRowPtrA, and csrColIndA.

This function requires the buffer size returned by csrilu02_bufferSize(). The address of pBuffer must be a multiple of 128 bytes. If not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function csrilu02_analysis() reports a structural zero and computes level information stored in the opaque structure info. The level information can extract more parallelism during incomplete LU factorization; however csrilu02() can be done without level information. To disable level information, the user must specify the policy of csrilu02() as CUSPARSE_SOLVE_POLICY_NO_LEVEL.

It is the user’s choice whether to call csrilu02() if csrilu02_analysis() reports a structural zero. In this case, the user can still call csrilu02(), which will return a numerical zero at the same position as the structural zero. However the result is meaningless.

  • This function requires temporary extra storage that is allocated internally

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrix A.

nnz

number of nonzeros of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

csrValA

<type> array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrix A.

csrRowPtrA

integer array of m\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrix A.

info

structure initialized using cusparseCreateCsrilu02Info().

policy

the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user, the size is returned by csrilu02_bufferSize().

Output

info

Structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged).

See cusparseStatus_t for the description of the return status.

5.7.2.4. cusparse<t>csrilu02() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseScsrilu02(cusparseHandle_t         handle,
                  int                      m,
                  int                      nnz,
                  const cusparseMatDescr_t descrA,
                  float*                   csrValA_valM,
                  const int*               csrRowPtrA,
                  const int*               csrColIndA,
                  csrilu02Info_t           info,
                  cusparseSolvePolicy_t    policy,
                  void*                    pBuffer)

cusparseStatus_t
cusparseDcsrilu02(cusparseHandle_t         handle,
                  int                      m,
                  int                      nnz,
                  const cusparseMatDescr_t descrA,
                  double*                  csrValA_valM,
                  const int*               csrRowPtrA,
                  const int*               csrColIndA,
                  csrilu02Info_t           info,
                  cusparseSolvePolicy_t    policy,
                  void*                    pBuffer)

cusparseStatus_t
cusparseCcsrilu02(cusparseHandle_t         handle,
                  int                      m,
                  int                      nnz,
                  const cusparseMatDescr_t descrA,
                  cuComplex*               csrValA_valM,
                  const int*               csrRowPtrA,
                  const int*               csrColIndA,
                  csrilu02Info_t           info,
                  cusparseSolvePolicy_t    policy,
                  void*                    pBuffer)

cusparseStatus_t
cusparseZcsrilu02(cusparseHandle_t         handle,
                  int                      m,
                  int                      nnz,
                  const cusparseMatDescr_t descrA,
                  cuDoubleComplex*         csrValA_valM,
                  const int*               csrRowPtrA,
                  const int*               csrColIndA,
                  csrilu02Info_t           info,
                  cusparseSolvePolicy_t    policy,
                  void*                    pBuffer)

This function performs the solve phase of the incomplete-LU factorization with \(0\) fill-in and no pivoting:

\(A \approx LU\)

A is an m×m sparse matrix that is defined in CSR storage format by the three arrays csrValA_valM, csrRowPtrA, and csrColIndA.

This function requires a buffer size returned by csrilu02_bufferSize(). The address of pBuffer must be a multiple of 128 bytes. If not, CUSPARSE_STATUS_INVALID_VALUE is returned.

The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL. The fill mode and diagonal type are ignored.

Although csrilu02() can be done without level information, the user still needs to be aware of consistency. If csrilu02_analysis() is called with policy CUSPARSE_SOLVE_POLICY_USE_LEVEL, csrilu02() can be run with or without levels. On the other hand, if csrilu02_analysis() is called with CUSPARSE_SOLVE_POLICY_NO_LEVEL, csrilu02() can only accept CUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function csrilu02() reports the first numerical zero, including a structural zero. The user must call cusparseXcsrilu02_zeroPivot() to know where the numerical zero is.

For example, suppose A is a real m × m matrix, the following code solves precondition system M*y = x where M is the product of LU factors L and U.

// Suppose that A is m x m sparse matrix represented by CSR format,
// Assumption:
// - handle is already created by cusparseCreate(),
// - (d_csrRowPtr, d_csrColInd, d_csrVal) is CSR of A on device memory,
// - d_x is right hand side vector on device memory,
// - d_y is solution vector on device memory.
// - d_z is intermediate result on device memory.

cusparseMatDescr_t descr_M = 0;
cusparseMatDescr_t descr_L = 0;
cusparseMatDescr_t descr_U = 0;
csrilu02Info_t info_M  = 0;
csrsv2Info_t  info_L  = 0;
csrsv2Info_t  info_U  = 0;
int pBufferSize_M;
int pBufferSize_L;
int pBufferSize_U;
int pBufferSize;
void *pBuffer = 0;
int structural_zero;
int numerical_zero;
const double alpha = 1.;
const cusparseSolvePolicy_t policy_M = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
const cusparseSolvePolicy_t policy_L = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
const cusparseSolvePolicy_t policy_U = CUSPARSE_SOLVE_POLICY_USE_LEVEL;
const cusparseOperation_t trans_L  = CUSPARSE_OPERATION_NON_TRANSPOSE;
const cusparseOperation_t trans_U  = CUSPARSE_OPERATION_NON_TRANSPOSE;

// step 1: create a descriptor which contains
// - matrix M is base-1
// - matrix L is base-1
// - matrix L is lower triangular
// - matrix L has unit diagonal
// - matrix U is base-1
// - matrix U is upper triangular
// - matrix U has non-unit diagonal
cusparseCreateMatDescr(&descr_M);
cusparseSetMatIndexBase(descr_M, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatType(descr_M, CUSPARSE_MATRIX_TYPE_GENERAL);

cusparseCreateMatDescr(&descr_L);
cusparseSetMatIndexBase(descr_L, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatType(descr_L, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatFillMode(descr_L, CUSPARSE_FILL_MODE_LOWER);
cusparseSetMatDiagType(descr_L, CUSPARSE_DIAG_TYPE_UNIT);

cusparseCreateMatDescr(&descr_U);
cusparseSetMatIndexBase(descr_U, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatType(descr_U, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatFillMode(descr_U, CUSPARSE_FILL_MODE_UPPER);
cusparseSetMatDiagType(descr_U, CUSPARSE_DIAG_TYPE_NON_UNIT);

// step 2: create a empty info structure
// we need one info for csrilu02 and two info's for csrsv2
cusparseCreateCsrilu02Info(&info_M);
cusparseCreateCsrsv2Info(&info_L);
cusparseCreateCsrsv2Info(&info_U);

// step 3: query how much memory used in csrilu02 and csrsv2, and allocate the buffer
cusparseDcsrilu02_bufferSize(handle, m, nnz,
    descr_M, d_csrVal, d_csrRowPtr, d_csrColInd, info_M, &pBufferSize_M);
cusparseDcsrsv2_bufferSize(handle, trans_L, m, nnz,
    descr_L, d_csrVal, d_csrRowPtr, d_csrColInd, info_L, &pBufferSize_L);
cusparseDcsrsv2_bufferSize(handle, trans_U, m, nnz,
    descr_U, d_csrVal, d_csrRowPtr, d_csrColInd, info_U, &pBufferSize_U);

pBufferSize = max(pBufferSize_M, max(pBufferSize_L, pBufferSize_U));

// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.
cudaMalloc((void**)&pBuffer, pBufferSize);

// step 4: perform analysis of incomplete Cholesky on M
//         perform analysis of triangular solve on L
//         perform analysis of triangular solve on U
// The lower(upper) triangular part of M has the same sparsity pattern as L(U),
// we can do analysis of csrilu0 and csrsv2 simultaneously.

cusparseDcsrilu02_analysis(handle, m, nnz, descr_M,
    d_csrVal, d_csrRowPtr, d_csrColInd, info_M,
    policy_M, pBuffer);
status = cusparseXcsrilu02_zeroPivot(handle, info_M, &structural_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
   printf("A(%d,%d) is missing\n", structural_zero, structural_zero);
}

cusparseDcsrsv2_analysis(handle, trans_L, m, nnz, descr_L,
    d_csrVal, d_csrRowPtr, d_csrColInd,
    info_L, policy_L, pBuffer);

cusparseDcsrsv2_analysis(handle, trans_U, m, nnz, descr_U,
    d_csrVal, d_csrRowPtr, d_csrColInd,
    info_U, policy_U, pBuffer);

// step 5: M = L * U
cusparseDcsrilu02(handle, m, nnz, descr_M,
    d_csrVal, d_csrRowPtr, d_csrColInd, info_M, policy_M, pBuffer);
status = cusparseXcsrilu02_zeroPivot(handle, info_M, &numerical_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){
   printf("U(%d,%d) is zero\n", numerical_zero, numerical_zero);
}

// step 6: solve L*z = x
cusparseDcsrsv2_solve(handle, trans_L, m, nnz, &alpha, descr_L, // replace with cusparseSpSV
   d_csrVal, d_csrRowPtr, d_csrColInd, info_L,
   d_x, d_z, policy_L, pBuffer);

// step 7: solve U*y = z
cusparseDcsrsv2_solve(handle, trans_U, m, nnz, &alpha, descr_U, // replace with cusparseSpSV
   d_csrVal, d_csrRowPtr, d_csrColInd, info_U,
   d_z, d_y, policy_U, pBuffer);

// step 6: free resources
cudaFree(pBuffer);
cusparseDestroyMatDescr(descr_M);
cusparseDestroyMatDescr(descr_L);
cusparseDestroyMatDescr(descr_U);
cusparseDestroyCsrilu02Info(info_M);
cusparseDestroyCsrsv2Info(info_L);
cusparseDestroyCsrsv2Info(info_U);
cusparseDestroy(handle);

The function supports the following properties if pBuffer != NULL

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrix A.

nnz

number of nonzeros of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

csrValA_valM

<type> array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrix A.

csrRowPtrA

integer array of m\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array of nnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrix A.

info

structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged).

policy

the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned by csrilu02_bufferSize().

Output

csrValA_valM

<type> matrix containing the incomplete-LU lower and upper triangular factors.

See cusparseStatus_t for the description of the return status.

5.7.2.5. cusparseXcsrilu02_zeroPivot() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseXcsrilu02_zeroPivot(cusparseHandle_t handle,
                            csrilu02Info_t   info,
                            int*             position)

If the returned error code is CUSPARSE_STATUS_ZERO_PIVOT, position=j means A(j,j) has either a structural zero or a numerical zero; otherwise, position=-1.

The position can be 0-based or 1-based, the same as the matrix.

Function cusparseXcsrilu02_zeroPivot() is a blocking call. It calls cudaDeviceSynchronize() to make sure all previous kernels are done.

The position can be in the host memory or device memory. The user can set proper mode with cusparseSetPointerMode().

  • The routine requires no extra storage

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

Handle to the cuSPARSE library context.

info

info contains structural zero or numerical zero if the user already called csrilu02_analysis() or csrilu02().

Output

position

If no structural or numerical zero, position is -1; otherwise if A(j,j) is missing or U(j,j) is zero, position=j.

See cusparseStatus_t for the description of the return status.

5.7.2.6. cusparse<t>bsrilu02_numericBoost() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsrilu02_numericBoost(cusparseHandle_t handle,
                               bsrilu02Info_t   info,
                               int              enable_boost,
                               double*          tol,
                               float*           boost_val)

cusparseStatus_t
cusparseDbsrilu02_numericBoost(cusparseHandle_t handle,
                               bsrilu02Info_t   info,
                               int              enable_boost,
                               double*          tol,
                               double*          boost_val)

cusparseStatus_t
cusparseCbsrilu02_numericBoost(cusparseHandle_t handle,
                               bsrilu02Info_t   info,
                               int              enable_boost,
                               double*          tol,
                               cuComplex*       boost_val)

cusparseStatus_t
cusparseZbsrilu02_numericBoost(cusparseHandle_t handle,
                               bsrilu02Info_t   info,
                               int              enable_boost,
                               double*          tol,
                               cuDoubleComplex* boost_val)

The user can use a boost value to replace a numerical value in incomplete LU factorization. Parameter tol is used to determine a numerical zero, and boost_val is used to replace a numerical zero. The behavior is as follows:

if tol >= fabs(A(j,j)), then reset each diagonal element of block A(j,j) by boost_val.

To enable a boost value, the user sets parameter enable_boost to 1 before calling bsrilu02(). To disable the boost value, the user can call bsrilu02_numericBoost() with parameter enable_boost=0.

If enable_boost=0, tol and boost_val are ignored.

Both tol and boost_val can be in host memory or device memory. The user can set the proper mode with cusparseSetPointerMode().

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

info

structure initialized using cusparseCreateBsrilu02Info().

enable_boost

disable boost by setting enable_boost=0. Otherwise, boost is enabled.

tol

tolerance to determine a numerical zero.

boost_val

boost value to replace a numerical zero.

See cusparseStatus_t for the description of the return status.

5.7.2.7. cusparse<t>bsrilu02_bufferSize() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsrilu02_bufferSize(cusparseHandle_t handle,
                             cusparseDirection_t dirA,
                             int mb,
                             int nnzb,
                             const cusparseMatDescr_t descrA,
                             float *bsrValA,
                             const int *bsrRowPtrA,
                             const int *bsrColIndA,
                             int blockDim,
                             bsrilu02Info_t info,
                             int *pBufferSizeInBytes);

cusparseStatus_t
cusparseDbsrilu02_bufferSize(cusparseHandle_t handle,
                             cusparseDirection_t dirA,
                             int mb,
                             int nnzb,
                             const cusparseMatDescr_t descrA,
                             double *bsrValA,
                             const int *bsrRowPtrA,
                             const int *bsrColIndA,
                             int blockDim,
                             bsrilu02Info_t info,
                             int *pBufferSizeInBytes);

cusparseStatus_t
cusparseCbsrilu02_bufferSize(cusparseHandle_t handle,
                             cusparseDirection_t dirA,
                             int mb,
                             int nnzb,
                             const cusparseMatDescr_t descrA,
                             cuComplex *bsrValA,
                             const int *bsrRowPtrA,
                             const int *bsrColIndA,
                             int blockDim,
                             bsrilu02Info_t info,
                             int *pBufferSizeInBytes);

cusparseStatus_t
cusparseZbsrilu02_bufferSize(cusparseHandle_t handle,
                             cusparseDirection_t dirA,
                             int mb,
                             int nnzb,
                             const cusparseMatDescr_t descrA,
                             cuDoubleComplex *bsrValA,
                             const int *bsrRowPtrA,
                             const int *bsrColIndA,
                             int blockDim,
                             bsrilu02Info_t info,
                             int *pBufferSizeInBytes);

This function returns the size of the buffer used in computing the incomplete-LU factorization with 0 fill-in and no pivoting.

\(A \approx LU\)

A is an (mb*blockDim)*(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA.

The buffer size depends on the dimensions of mb, blockDim, and the number of nonzero blocks of the matrix nnzb. If the user changes the matrix, it is necessary to call bsrilu02_bufferSize() again to have the correct buffer size; otherwise, a segmentation fault may occur.

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and columns of matrix A.

nnzb

number of nonzero blocks of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A, larger than zero.

Output

info

record internal states based on different algorithms.

pBufferSizeInBytes

number of bytes of the buffer used in bsrilu02_analysis() and bsrilu02().

Status Returned

CUSPARSE_STATUS_SUCCESS

the operation completed successfully.

CUSPARSE_STATUS_NOT_INITIALIZED

the library was not initialized.

CUSPARSE_STATUS_ALLOC_FAILED

the resources could not be allocated.

CUSPARSE_STATUS_INVALID_VALUE

invalid parameters were passed (mb,nnzb<=0), base index is not 0 or 1.

CUSPARSE_STATUS_ARCH_MISMATCH

the device only supports compute capability 2.0 and above.

CUSPARSE_STATUS_INTERNAL_ERROR

an internal operation failed.

CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED

the matrix type is not supported.

5.7.2.8. cusparse<t>bsrilu02_analysis() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsrilu02_analysis(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           int                      mb,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           float*                   bsrValA,
                           const int*               bsrRowPtrA,
                           const int*               bsrColIndA,
                           int                      blockDim,
                           bsrilu02Info_t           info,
                           cusparseSolvePolicy_t    policy,
                           void*                    pBuffer)

cusparseStatus_t
cusparseDbsrilu02_analysis(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           int                      mb,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           double*                  bsrValA,
                           const int*               bsrRowPtrA,
                           const int*               bsrColIndA,
                           int                      blockDim,
                           bsrilu02Info_t           info,
                           cusparseSolvePolicy_t    policy,
                           void*                    pBuffer)

cusparseStatus_t
cusparseCbsrilu02_analysis(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           int                      mb,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           cuComplex*               bsrValA,
                           const int*               bsrRowPtrA,
                           const int*               bsrColIndA,
                           int                      blockDim,
                           bsrilu02Info_t           info,
                           cusparseSolvePolicy_t    policy,
                           void*                    pBuffer)

cusparseStatus_t
cusparseZbsrilu02_analysis(cusparseHandle_t         handle,
                           cusparseDirection_t      dirA,
                           int                      mb,
                           int                      nnzb,
                           const cusparseMatDescr_t descrA,
                           cuDoubleComplex*         bsrValA,
                           const int*               bsrRowPtrA,
                           const int*               bsrColIndA,
                           int                      blockDim,
                           bsrilu02Info_t           info,
                           cusparseSolvePolicy_t    policy,
                           void*                    pBuffer)

This function performs the analysis phase of the incomplete-LU factorization with 0 fill-in and no pivoting.

\(A \approx LU\)

A is an (mb*blockDim)×(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA. The block in BSR format is of size blockDim*blockDim, stored as column-major or row-major as determined by parameter dirA, which is either CUSPARSE_DIRECTION_COLUMN or CUSPARSE_DIRECTION_ROW. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

This function requires a buffer size returned by bsrilu02_bufferSize(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function bsrilu02_analysis() reports a structural zero and computes level information stored in the opaque structure info. The level information can extract more parallelism during incomplete LU factorization. However bsrilu02() can be done without level information. To disable level information, the user needs to specify the parameter policy of bsrilu02[_analysis| ] as CUSPARSE_SOLVE_POLICY_NO_LEVEL.

Function bsrilu02_analysis() always reports the first structural zero, even with parameter policy is CUSPARSE_SOLVE_POLICY_NO_LEVEL. The user must call cusparseXbsrilu02_zeroPivot() to know where the structural zero is.

It is the user’s choice whether to call bsrilu02() if bsrilu02_analysis() reports a structural zero. In this case, the user can still call bsrilu02(), which will return a numerical zero at the same position as the structural zero. However the result is meaningless.

  • This function requires temporary extra storage that is allocated internally

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and block columns of matrix A.

nnzb

number of nonzero blocks of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A, larger than zero.

info

structure initialized using cusparseCreateBsrilu02Info().

policy

the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user, the size is returned by bsrilu02_bufferSize().

Output

info

structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged)

See cusparseStatus_t for the description of the return status.

5.7.2.9. cusparse<t>bsrilu02() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseSbsrilu02(cusparseHandle_t         handle,
                  cusparseDirection_t      dirA,
                  int                      mb,
                  int                      nnzb,
                  const cusparseMatDescr_t descry,
                  float*                   bsrValA,
                  const int*               bsrRowPtrA,
                  const int*               bsrColIndA,
                  int                      blockDim,
                  bsrilu02Info_t           info,
                  cusparseSolvePolicy_t    policy,
                  void*                    pBuffer)

cusparseStatus_t
cusparseDbsrilu02(cusparseHandle_t         handle,
                  cusparseDirection_t      dirA,
                  int                      mb,
                  int                      nnzb,
                  const cusparseMatDescr_t descry,
                  double*                  bsrValA,
                  const int*               bsrRowPtrA,
                  const int*               bsrColIndA,
                  int                      blockDim,
                  bsrilu02Info_t           info,
                  cusparseSolvePolicy_t    policy,
                  void*                    pBuffer)

cusparseStatus_t
cusparseCbsrilu02(cusparseHandle_t         handle,
                  cusparseDirection_t      dirA,
                  int                      mb,
                  int                      nnzb,
                  const cusparseMatDescr_t descry,
                  cuComplex*               bsrValA,
                  const int*               bsrRowPtrA,
                  const int*               bsrColIndA,
                  int                      blockDim,
                  bsrilu02Info_t           info,
                  cusparseSolvePolicy_t    policy,
                  void*                    pBuffer)

cusparseStatus_t
cusparseZbsrilu02(cusparseHandle_t         handle,
                  cusparseDirection_t      dirA,
                  int                      mb,
                  int                      nnzb,
                  const cusparseMatDescr_t descry,
                  cuDoubleComplex*         bsrValA,
                  const int*               bsrRowPtrA,
                  const int*               bsrColIndA,
                  int                      blockDim,
                  bsrilu02Info_t           info,
                  cusparseSolvePolicy_t    policy,
                  void*                    pBuffer)

This function performs the solve phase of the incomplete-LU factorization with 0 fill-in and no pivoting.

\(A \approx LU\)

A is an (mb*blockDim)×(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arrays bsrValA, bsrRowPtrA, and bsrColIndA. The block in BSR format is of size blockDim*blockDim, stored as column-major or row-major determined by parameter dirA, which is either CUSPARSE_DIRECTION_COLUMN or CUSPARSE_DIRECTION_ROW. The matrix type must be CUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored. Function bsrilu02() supports an arbitrary blockDim.

This function requires a buffer size returned by bsrilu02_bufferSize(). The address of pBuffer must be a multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

Although bsrilu02() can be used without level information, the user must be aware of consistency. If bsrilu02_analysis() is called with policy CUSPARSE_SOLVE_POLICY_USE_LEVEL, bsrilu02() can be run with or without levels. On the other hand, if bsrilu02_analysis() is called with CUSPARSE_SOLVE_POLICY_NO_LEVEL, bsrilu02() can only accept CUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise, CUSPARSE_STATUS_INVALID_VALUE is returned.

Function bsrilu02() has the same behavior as csrilu02(). That is, bsr2csr(bsrilu02(A)) = csrilu02(bsr2csr(A)). The numerical zero of csrilu02() means there exists some zero U(j,j). The numerical zero of bsrilu02() means there exists some block U(j,j) that is not invertible.

Function bsrilu02 reports the first numerical zero, including a structural zero. The user must call cusparseXbsrilu02_zeroPivot() to know where the numerical zero is.

For example, suppose A is a real m-by-m matrix where m=mb*blockDim. The following code solves precondition system M*y =                                  x, where M is the product of LU factors L and U.

// Suppose that A is m x m sparse matrix represented by BSR format,
// The number of block rows/columns is mb, and
// the number of nonzero blocks is nnzb.
// Assumption:
// - handle is already created by cusparseCreate(),
// - (d_bsrRowPtr, d_bsrColInd, d_bsrVal) is BSR of A on device memory,
// - d_x is right hand side vector on device memory.
// - d_y is solution vector on device memory.
// - d_z is intermediate result on device memory.
// - d_x, d_y and d_z are of size m.
cusparseMatDescr_t descr_M = 0;
cusparseMatDescr_t descr_L = 0;
cusparseMatDescr_t descr_U = 0;
bsrilu02Info_t info_M = 0;
bsrsv2Info_t   info_L = 0;
bsrsv2Info_t   info_U = 0;
int pBufferSize_M;
int pBufferSize_L;
int pBufferSize_U;
int pBufferSize;
void *pBuffer = 0;
int structural_zero;
int numerical_zero;
const double alpha = 1.;
const cusparseSolvePolicy_t policy_M = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
const cusparseSolvePolicy_t policy_L = CUSPARSE_SOLVE_POLICY_NO_LEVEL;
const cusparseSolvePolicy_t policy_U = CUSPARSE_SOLVE_POLICY_USE_LEVEL;
const cusparseOperation_t trans_L  = CUSPARSE_OPERATION_NON_TRANSPOSE;
const cusparseOperation_t trans_U  = CUSPARSE_OPERATION_NON_TRANSPOSE;
const cusparseDirection_t dir = CUSPARSE_DIRECTION_COLUMN;

// step 1: create a descriptor which contains
// - matrix M is base-1
// - matrix L is base-1
// - matrix L is lower triangular
// - matrix L has unit diagonal
// - matrix U is base-1
// - matrix U is upper triangular
// - matrix U has non-unit diagonal
cusparseCreateMatDescr(&descr_M);
cusparseSetMatIndexBase(descr_M, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatType(descr_M, CUSPARSE_MATRIX_TYPE_GENERAL);

cusparseCreateMatDescr(&descr_L);
cusparseSetMatIndexBase(descr_L, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatType(descr_L, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatFillMode(descr_L, CUSPARSE_FILL_MODE_LOWER);
cusparseSetMatDiagType(descr_L, CUSPARSE_DIAG_TYPE_UNIT);

cusparseCreateMatDescr(&descr_U);
cusparseSetMatIndexBase(descr_U, CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatType(descr_U, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatFillMode(descr_U, CUSPARSE_FILL_MODE_UPPER);
cusparseSetMatDiagType(descr_U, CUSPARSE_DIAG_TYPE_NON_UNIT);

// step 2: create a empty info structure
// we need one info for bsrilu02 and two info's for bsrsv2
cusparseCreateBsrilu02Info(&info_M);
cusparseCreateBsrsv2Info(&info_L);
cusparseCreateBsrsv2Info(&info_U);

// step 3: query how much memory used in bsrilu02 and bsrsv2, and allocate the buffer
cusparseDbsrilu02_bufferSize(handle, dir, mb, nnzb,
    descr_M, d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_M, &pBufferSize_M);
cusparseDbsrsv2_bufferSize(handle, dir, trans_L, mb, nnzb,
    descr_L, d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_L, &pBufferSize_L);
cusparseDbsrsv2_bufferSize(handle, dir, trans_U, mb, nnzb,
    descr_U, d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_U, &pBufferSize_U);

pBufferSize = max(pBufferSize_M, max(pBufferSize_L, pBufferSize_U));

// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.
cudaMalloc((void**)&pBuffer, pBufferSize);

// step 4: perform analysis of incomplete LU factorization on M
//         perform analysis of triangular solve on L
//         perform analysis of triangular solve on U
// The lower(upper) triangular part of M has the same sparsity pattern as L(U),
// we can do analysis of bsrilu0 and bsrsv2 simultaneously.

cusparseDbsrilu02_analysis(handle, dir, mb, nnzb, descr_M,
    d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_M,
    policy_M, pBuffer);
status = cusparseXbsrilu02_zeroPivot(handle, info_M, &structural_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == statuss){
   printf("A(%d,%d) is missing\n", structural_zero, structural_zero);
}

cusparseDbsrsv2_analysis(handle, dir, trans_L, mb, nnzb, descr_L,
    d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim,
    info_L, policy_L, pBuffer);

cusparseDbsrsv2_analysis(handle, dir, trans_U, mb, nnzb, descr_U,
    d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim,
    info_U, policy_U, pBuffer);

// step 5: M = L * U
cusparseDbsrilu02(handle, dir, mb, nnzb, descr_M,
    d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_M, policy_M, pBuffer);
status = cusparseXbsrilu02_zeroPivot(handle, info_M, &numerical_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == statuss){
   printf("block U(%d,%d) is not invertible\n", numerical_zero, numerical_zero);
}

// step 6: solve L*z = x
cusparseDbsrsv2_solve(handle, dir, trans_L, mb, nnzb, &alpha, descr_L,
   d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_L,
   d_x, d_z, policy_L, pBuffer);

// step 7: solve U*y = z
cusparseDbsrsv2_solve(handle, dir, trans_U, mb, nnzb, &alpha, descr_U,
   d_bsrVal, d_bsrRowPtr, d_bsrColInd, blockDim, info_U,
   d_z, d_y, policy_U, pBuffer);

// step 6: free resources
cudaFree(pBuffer);
cusparseDestroyMatDescr(descr_M);
cusparseDestroyMatDescr(descr_L);
cusparseDestroyMatDescr(descr_U);
cusparseDestroyBsrilu02Info(info_M);
cusparseDestroyBsrsv2Info(info_L);
cusparseDestroyBsrsv2Info(info_U);
cusparseDestroy(handle);

The function supports the following properties if pBuffer != NULL

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks: either CUSPARSE_DIRECTION_ROW or CUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and block columns of matrix A.

nnzb

number of nonzero blocks of matrix A.

descrA

the descriptor of matrix A. The supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases are CUSPARSE_INDEX_BASE_ZERO and CUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrix A.

bsrRowPtrA

integer array of mb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array of nnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix A.

blockDim

block dimension of sparse matrix A; must be larger than zero.

info

structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged).

policy

the supported policies are CUSPARSE_SOLVE_POLICY_NO_LEVEL and CUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned by bsrilu02_bufferSize().

Output

bsrValA

<type> matrix containing the incomplete-LU lower and upper triangular factors

See cusparseStatus_t for the description of the return status.

5.7.2.10. cusparseXbsrilu02_zeroPivot() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseXbsrilu02_zeroPivot(cusparseHandle_t handle,
                            bsrilu02Info_t   info,
                            int*             position)

If the returned error code is CUSPARSE_STATUS_ZERO_PIVOT, position=j means A(j,j) has either a structural zero or a numerical zero (the block is not invertible). Otherwise position=-1.

The position can be 0-based or 1-based, the same as the matrix.

Function cusparseXbsrilu02_zeroPivot() is a blocking call. It calls cudaDeviceSynchronize() to make sure all previous kernels are done.

The position can be in the host memory or device memory. The user can set proper the mode with cusparseSetPointerMode().

  • The routine requires no extra storage

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

info

info contains structural zero or numerical zero if the user already called bsrilu02_analysis() or bsrilu02().

Output

position

if no structural or numerical zero, position is -1; otherwise if A(j,j) is missing or U(j,j) is not invertible, position=j.

See cusparseStatus_t for the description of the return status.

5.7.3. Tridiagonal Solve

Different algorithms for tridiagonal solve are discussed in this section.

5.7.3.1. cusparse<t>gtsv2_buffSizeExt()

cusparseStatus_t
cusparseSgtsv2_bufferSizeExt(cusparseHandle_t handle,
                             int              m,
                             int              n,
                             const float*     dl,
                             const float*     d,
                             const float*     du,
                             const float*     B,
                             int              ldb,
                             size_t*          bufferSizeInBytes)

cusparseStatus_t
cusparseDgtsv2_bufferSizeExt(cusparseHandle_t handle,
                             int              m,
                             int              n,
                             const double*    dl,
                             const double*    d,
                             const double*    du,
                             const double*    B,
                             int              ldb,
                             size_t*          bufferSizeInBytes)

cusparseStatus_t
cusparseCgtsv2_bufferSizeExt(cusparseHandle_t handle,
                             int              m,
                             int              n,
                             const cuComplex* dl,
                             const cuComplex* d,
                             const cuComplex* du,
                             const cuComplex* B,
                             int              ldb,
                             size_t*          bufferSizeInBytes)

cusparseStatus_t
cusparseZgtsv2_bufferSizeExt(cusparseHandle_t       handle,
                             int                    m,
                             int                    n,
                             const cuDoubleComplex* dl,
                             const cuDoubleComplex* d,
                             const cuDoubleComplex* du,
                             const cuDoubleComplex* B,
                             int                    ldb,
                             size_t*                bufferSizeInBytes)

This function returns the size of the buffer used in gtsv2 which computes the solution of a tridiagonal linear system with multiple right-hand sides.

\(A \ast X = B\)

The coefficient matrix A of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrix B. Notice that solution X overwrites right-hand-side matrix B on exit.

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

m

the size of the linear system (must be ≥ 3).

n

number of right-hand sides, columns of matrix B.

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The last element of each upper diagonal must be zero.

B

<type> dense right-hand-side array of dimensions (ldb, n).

ldb

leading dimension of B (that is ≥ \(\max\text{(1,\ m))}\) .

Output

pBufferSizeInBytes

number of bytes of the buffer used in the gtsv2.

See cusparseStatus_t for the description of the return status.

5.7.3.2. cusparse<t>gtsv2()

cusparseStatus_t
cusparseSgtsv2(cusparseHandle_t handle,
               int              m,
               int              n,
               const float*     dl,
               const float*     d,
               const float*     du,
               float*           B,
               int              ldb,
               void*            pBuffer)

cusparseStatus_t
cusparseDgtsv2(cusparseHandle_t handle,
               int              m,
               int              n,
               const double*    dl,
               const double*    d,
               const double*    du,
               double*          B,
               int              ldb,
               void*            pBuffer)

cusparseStatus_t
cusparseCgtsv2(cusparseHandle_t handle,
               int              m,
               int              n,
               const cuComplex* dl,
               const cuComplex* d,
               const cuComplex* du,
               cuComplex*       B,
               int              ldb,
               void*            pBuffer)

cusparseStatus_t
cusparseZgtsv2(cusparseHandle_t       handle,
               int                    m,
               int                    n,
               const cuDoubleComplex* dl,
               const cuDoubleComplex* d,
               const cuDoubleComplex* du,
               cuDoubleComplex*       B,
               int                    ldb,
               void*                  pBuffer)

This function computes the solution of a tridiagonal linear system with multiple right-hand sides:

\(A \ast X = B\)

The coefficient matrix A of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrix B. Notice that solution X overwrites right-hand-side matrix B on exit.

Assuming A is of size m and base-1, dl, d and du are defined by the following formula:

dl(i) := A(i, i-1) for i=1,2,...,m

The first element of dl is out-of-bound (dl(1) := A(1,0)), so dl(1) = 0.

d(i) = A(i,i) for i=1,2,...,m

du(i) = A(i,i+1) for i=1,2,...,m

The last element of du is out-of-bound (du(m) := A(m,m+1)), so du(m) = 0.

The routine does perform pivoting, which usually results in more accurate and more stable results than cusparse<t>gtsv_nopivot() or cusparse<t>gtsv2_nopivot() at the expense of some execution time.

This function requires a buffer size returned by gtsv2_bufferSizeExt(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

m

the size of the linear system (must be ≥ 3).

n

number of right-hand sides, columns of matrix B.

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The last element of each upper diagonal must be zero.

B

<type> dense right-hand-side array of dimensions (ldb, n).

ldb

leading dimension of B (that is ≥ \(\max\text{(1,\ m))}\) .

pBuffer

buffer allocated by the user, the size is return by gtsv2_bufferSizeExt.

Output

B

<type> dense solution array of dimensions (ldb, n).

See cusparseStatus_t for the description of the return status.

5.7.3.3. cusparse<t>gtsv2_nopivot_bufferSizeExt()

cusparseStatus_t
cusparseSgtsv2_nopivot_bufferSizeExt(cusparseHandle_t handle,
                                     int              m,
                                     int              n,
                                     const float*     dl,
                                     const float*     d,
                                     const float*     du,
                                     const float*     B,
                                     int              ldb,
                                     size_t*          bufferSizeInBytes)

cusparseStatus_t
cusparseDgtsv2_nopivot_bufferSizeExt(cusparseHandle_t handle,
                                     int              m,
                                     int              n,
                                     const double*    dl,
                                     const double*    d,
                                     const double*    du,
                                     const double*    B,
                                     int              ldb,
                                     size_t*          bufferSizeInBytes)

cusparseStatus_t
cusparseCgtsv2_nopivot_bufferSizeExt(cusparseHandle_t handle,
                                     int              m,
                                     int              n,
                                     const cuComplex* dl,
                                     const cuComplex* d,
                                     const cuComplex* du,
                                     const cuComplex* B,
                                     int              ldb,
                                     size_t*          bufferSizeInBytes)

cusparseStatus_t
cusparseZgtsv2_nopivot_bufferSizeExt(cusparseHandle_t       handle,
                                     int                    m,
                                     int                    n,
                                     const cuDoubleComplex* dl,
                                     const cuDoubleComplex* d,
                                     const cuDoubleComplex* du,
                                     const cuDoubleComplex* B,
                                     int                    ldb,
                                     size_t*                bufferSizeInBytes)

This function returns the size of the buffer used in gtsv2_nopivot which computes the solution of a tridiagonal linear system with multiple right-hand sides.

\(A \ast X = B\)

The coefficient matrix A of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrix B. Notice that solution X overwrites right-hand-side matrix B on exit.

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

m

the size of the linear system (must be ≥ 3).

n

number of right-hand sides, columns of matrix B.

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The last element of each upper diagonal must be zero.

B

<type> dense right-hand-side array of dimensions (ldb, n).

ldb

leading dimension of B. (that is ≥ \(\max\text{(1,\ m))}\) .

Output

pBufferSizeInBytes

number of bytes of the buffer used in the gtsv2_nopivot.

See cusparseStatus_t for the description of the return status.

5.7.3.4. cusparse<t>gtsv2_nopivot()

cusparseStatus_t
cusparseSgtsv2_nopivot(cusparseHandle_t handle,
                       int              m,
                       int              n,
                       const float*     dl,
                       const float*     d,
                       const float*     du,
                       float*           B,
                       int              ldb,
                       void*            pBuffer)

cusparseStatus_t
cusparseDgtsv2_nopivot(cusparseHandle_t handle,
                       int              m,
                       int              n,
                       const double*    dl,
                       const double*    d,
                       const double*    du,
                       double*          B,
                       int              ldb,
                       void*            pBuffer)

cusparseStatus_t
cusparseCgtsv2_nopivot(cusparseHandle_t handle,
                       int              m,
                       int              n,
                       const cuComplex* dl,
                       const cuComplex* d,
                       const cuComplex* du,
                       cuComplex*       B,
                       int              ldb,
                       void*            pBuffer)

cusparseStatus_t
cusparseZgtsv2_nopivot(cusparseHandle_t       handle,
                       int                    m,
                       int                    n,
                       const cuDoubleComplex* dl,
                       const cuDoubleComplex* d,
                       const cuDoubleComplex* du,
                       cuDoubleComplex*       B,
                       int                    ldb,
                       void*                  pBuffer)

This function computes the solution of a tridiagonal linear system with multiple right-hand sides:

\(A \ast X = B\)

The coefficient matrix A of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrix B. Notice that solution X overwrites right-hand-side matrix B on exit.

The routine does not perform any pivoting and uses a combination of the Cyclic Reduction (CR) and the Parallel Cyclic Reduction (PCR) algorithms to find the solution. It achieves better performance when m is a power of 2.

This function requires a buffer size returned by gtsv2_nopivot_bufferSizeExt(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

m

the size of the linear system (must be ≥ 3).

n

number of right-hand sides, columns of matrix B.

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The last element of each upper diagonal must be zero.

B

<type> dense right-hand-side array of dimensions (ldb, n).

ldb

leading dimension of B. (that is ≥ \(\max\text{(1,\ m))}\) .

pBuffer

buffer allocated by the user, the size is return by gtsv2_nopivot_bufferSizeExt.

Output

B

<type> dense solution array of dimensions (ldb, n).

See cusparseStatus_t for the description of the return status.

5.7.4. Batched Tridiagonal Solve

Different algorithms for batched tridiagonal solve are discussed in this section.

5.7.4.1. cusparse<t>gtsv2StridedBatch_bufferSizeExt()

cusparseStatus_t
cusparseSgtsv2StridedBatch_bufferSizeExt(cusparseHandle_t handle,
                                         int              m,
                                         const float*     dl,
                                         const float*     d,
                                         const float*     du,
                                         const float*     x,
                                         int              batchCount,
                                         int              batchStride,
                                         size_t*          bufferSizeInBytes)

cusparseStatus_t
cusparseDgtsv2StridedBatch_bufferSizeExt(cusparseHandle_t handle,
                                         int              m,
                                         const double*    dl,
                                         const double*    d,
                                         const double*    du,
                                         const double*    x,
                                         int              batchCount,
                                         int              batchStride,
                                         size_t*          bufferSizeInBytes)

cusparseStatus_t
cusparseCgtsv2StridedBatch_bufferSizeExt(cusparseHandle_t handle,
                                         int              m,
                                         const cuComplex* dl,
                                         const cuComplex* d,
                                         const cuComplex* du,
                                         const cuComplex* x,
                                         int              batchCount,
                                         int              batchStride,
                                         size_t*          bufferSizeInBytes)

cusparseStatus_t
cusparseZgtsv2StridedBatch_bufferSizeExt(cusparseHandle_t       handle,
                                         int                    m,
                                         const cuDoubleComplex* dl,
                                         const cuDoubleComplex* d,
                                         const cuDoubleComplex* du,
                                         const cuDoubleComplex* x,
                                         int                    batchCount,
                                         int                    batchStride,
                                         size_t*                bufferSizeInBytes)

This function returns the size of the buffer used in gtsv2StridedBatch which computes the solution of multiple tridiagonal linear systems for i=0,…,batchCount:

\(A^{(i)} \ast \text{y}^{(i)} = \text{x}^{(i)}\)

The coefficient matrix A of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrix X. Notice that solution Y overwrites right-hand-side matrix X on exit. The different matrices are assumed to be of the same size and are stored with a fixed batchStride in memory.

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

n

the size of the linear system (must be ≥ 3).

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The lower diagonal \(dl^{(i)}\) that corresponds to the ith linear system starts at location dl+batchStride×i in memory. Also, the first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system. The main diagonal \(d^{(i)}\) that corresponds to the ith linear system starts at location d+batchStride×i in memory.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The upper diagonal \(du^{(i)}\) that corresponds to the ith linear system starts at location du+batchStride×i in memory. Also, the last element of each upper diagonal must be zero.

x

<type> dense array that contains the right-hand-side of the tri-diagonal linear system. The right-hand-side \(x^{(i)}\) that corresponds to the ith linear system starts at location x+batchStride×iin memory.

batchCount

number of systems to solve.

batchStride

stride (number of elements) that separates the vectors of every system (must be at least m).

Output

pBufferSizeInBytes

number of bytes of the buffer used in the gtsv2StridedBatch.

See cusparseStatus_t for the description of the return status.

5.7.4.2. cusparse<t>gtsv2StridedBatch()

cusparseStatus_t
cusparseSgtsv2StridedBatch(cusparseHandle_t handle,
                           int              m,
                           const float*     dl,
                           const float*     d,
                           const float*     du,
                           float*           x,
                           int              batchCount,
                           int              batchStride,
                           void*            pBuffer)

cusparseStatus_t
cusparseDgtsv2StridedBatch(cusparseHandle_t handle,
                           int              m,
                           const double*    dl,
                           const double*    d,
                           const double*    du,
                           double*          x,
                           int              batchCount,
                           int              batchStride,
                           void*            pBuffer)

cusparseStatus_t
cusparseCgtsv2StridedBatch(cusparseHandle_t handle,
                           int              m,
                           const cuComplex* dl,
                           const cuComplex* d,
                           const cuComplex* du,
                           cuComplex*       x,
                           int              batchCount,
                           int              batchStride,
                           void*            pBuffer)

cusparseStatus_t
cusparseZgtsv2StridedBatch(cusparseHandle_t       handle,
                           int                    m,
                           const cuDoubleComplex* dl,
                           const cuDoubleComplex* d,
                           const cuDoubleComplex* du,
                           cuDoubleComplex*       x,
                           int                    batchCount,
                           int                    batchStride,
                           void*                  pBuffer)

This function computes the solution of multiple tridiagonal linear systems for i=0,…,batchCount:

\(A^{(i)} \ast \text{y}^{(i)} = \text{x}^{(i)}\)

The coefficient matrix A of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrix X. Notice that solution Y overwrites right-hand-side matrix X on exit. The different matrices are assumed to be of the same size and are stored with a fixed batchStride in memory.

The routine does not perform any pivoting and uses a combination of the Cyclic Reduction (CR) and the Parallel Cyclic Reduction (PCR) algorithms to find the solution. It achieves better performance when m is a power of 2.

This function requires a buffer size returned by gtsv2StridedBatch_bufferSizeExt(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

n

the size of the linear system (must be ≥ 3).

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The lower diagonal \(dl^{(i)}\) that corresponds to the ith linear system starts at location dl+batchStride×i in memory. Also, the first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system. The main diagonal \(d^{(i)}\) that corresponds to the ith linear system starts at location d+batchStride×i in memory.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The upper diagonal \(du^{(i)}\) that corresponds to the ith linear system starts at location du+batchStride×i in memory. Also, the last element of each upper diagonal must be zero.

x

<type> dense array that contains the right-hand-side of the tri-diagonal linear system. The right-hand-side \(x^{(i)}\) that corresponds to the ith linear system starts at location x+batchStride×iin memory.

batchCount

number of systems to solve.

batchStride

stride (number of elements) that separates the vectors of every system (must be at least n).

pBuffer

buffer allocated by the user, the size is return by gtsv2StridedBatch_bufferSizeExt.

Output

x

<type> dense array that contains the solution of the tri-diagonal linear system. The solution \(x^{(i)}\) that corresponds to the ith linear system starts at location x+batchStride×iin memory.

See cusparseStatus_t for the description of the return status.

5.7.4.3. cusparse<t>gtsvInterleavedBatch()

cusparseStatus_t
cusparseSgtsvInterleavedBatch_bufferSizeExt(cusparseHandle_t handle,
                                            int              algo,
                                            int              m,
                                            const float*     dl,
                                            const float*     d,
                                            const float*     du,
                                            const float*     x,
                                            int              batchCount,
                                            size_t*          pBufferSizeInBytes)

cusparseStatus_t
cusparseDgtsvInterleavedBatch_bufferSizeExt(cusparseHandle_t handle,
                                            int              algo,
                                            int              m,
                                            const double*    dl,
                                            const double*    d,
                                            const double*    du,
                                            const double*    x,
                                            int              batchCount,
                                            size_t*          pBufferSizeInBytes)

cusparseStatus_t
cusparseCgtsvInterleavedBatch_bufferSizeExt(cusparseHandle_t handle,
                                            int              algo,
                                            int              m,
                                            const cuComplex* dl,
                                            const cuComplex* d,
                                            const cuComplex* du,
                                            const cuComplex* x,
                                            int              batchCount,
                                            size_t*          pBufferSizeInBytes)

cusparseStatus_t
cusparseZgtsvInterleavedBatch_bufferSizeExt(cusparseHandle_t       handle,
                                            int                    algo,
                                            int                    m,
                                            const cuDoubleComplex* dl,
                                            const cuDoubleComplex* d,
                                            const cuDoubleComplex* du,
                                            const cuDoubleComplex* x,
                                            int                    batchCount,
                                            size_t*                pBufferSizeInBytes)
cusparseStatus_t
cusparseSgtsvInterleavedBatch(cusparseHandle_t handle,
                              int              algo,
                              int              m,
                              float*           dl,
                              float*           d,
                              float*           du,
                              float*           x,
                              int              batchCount,
                              void*            pBuffer)

cusparseStatus_t
cusparseDgtsvInterleavedBatch(cusparseHandle_t handle,
                              int              algo,
                              int              m,
                              double*          dl,
                              double*          d,
                              double*          du,
                              double*          x,
                              int              batchCount,
                              void*            pBuffer)

cusparseStatus_t
cusparseCgtsvInterleavedBatch(cusparseHandle_t handle,
                              int              algo,
                              int              m,
                              cuComplex*       dl,
                              cuComplex*       d,
                              cuComplex*       du,
                              cuComplex*       x,
                              int              batchCount,
                              void*            pBuffer)

cusparseStatus_t
cusparseZgtsvInterleavedBatch(cusparseHandle_t handle,
                              int              algo,
                              int              m,
                              cuDoubleComplex* dl,
                              cuDoubleComplex* d,
                              cuDoubleComplex* du,
                              cuDoubleComplex* x,
                              int              batchCount,
                              void*            pBuffer)

This function computes the solution of multiple tridiagonal linear systems for i=0,…,batchCount:

\(A^{(i)} \ast \text{x}^{(i)} = \text{b}^{(i)}\)

The coefficient matrix A of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrix B. Notice that solution X overwrites right-hand-side matrix B on exit.

Assuming A is of size m and base-1, dl, d and du are defined by the following formula:

dl(i) := A(i, i-1) for i=1,2,...,m

The first element of dl is out-of-bound (dl(1) := A(1,0)), so dl(1) = 0.

d(i) = A(i,i) for i=1,2,...,m

du(i) = A(i,i+1) for i=1,2,...,m

The last element of du is out-of-bound (du(m) := A(m,m+1)), so du(m) = 0.

The data layout is different from gtsvStridedBatch which aggregates all matrices one after another. Instead, gtsvInterleavedBatch gathers different matrices of the same element in a continous manner. If dl is regarded as a 2-D array of size m-by-batchCount, dl(:,j) to store j-th matrix. gtsvStridedBatch uses column-major while gtsvInterleavedBatch uses row-major.

The routine provides three different algorithms, selected by parameter algo. The first algorithm is cuThomas provided by Barcelona Supercomputing Center. The second algorithm is LU with partial pivoting and last algorithm is QR. From stability perspective, cuThomas is not numerically stable because it does not have pivoting. LU with partial pivoting and QR are stable. From performance perspective, LU with partial pivoting and QR is about 10% to 20% slower than cuThomas.

This function requires a buffer size returned by gtsvInterleavedBatch_bufferSizeExt(). The address of pBuffer must be multiple of 128 bytes. If it is not, CUSPARSE_STATUS_INVALID_VALUE is returned.

If the user prepares aggregate format, one can use cublasXgeam to get interleaved format. However such transformation takes time comparable to solver itself. To reach best performance, the user must prepare interleaved format explicitly.

  • This function requires temporary extra storage that is allocated internally

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

algo

algo = 0: cuThomas (unstable algorithm); algo = 1: LU with pivoting (stable algorithm); algo = 2: QR (stable algorithm)

m

the size of the linear system.

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The last element of each upper diagonal must be zero.

x

<type> dense right-hand-side array of dimensions (batchCount, n).

pBuffer

buffer allocated by the user, the size is return by gtsvInterleavedBatch_bufferSizeExt.

Output

x

<type> dense solution array of dimensions (batchCount, n).

See cusparseStatus_t for the description of the return status.

5.7.5. Batched Pentadiagonal Solve

Different algorithms for batched pentadiagonal solve are discussed in this section.

5.7.5.1. cusparse<t>gpsvInterleavedBatch()

cusparseStatus_t
cusparseSgpsvInterleavedBatch_bufferSizeExt(cusparseHandle_t handle,
                                            int              algo,
                                            int              m,
                                            const float*     ds,
                                            const float*     dl,
                                            const float*     d,
                                            const float*     du,
                                            const float*     dw,
                                            const float*     x,
                                            int              batchCount,
                                            size_t*          pBufferSizeInBytes)

cusparseStatus_t
cusparseDgpsvInterleavedBatch_bufferSizeExt(cusparseHandle_t handle,
                                            int              algo,
                                            int              m,
                                            const double*    ds,
                                            const double*    dl,
                                            const double*    d,
                                            const double*    du,
                                            const double*    dw,
<