Generic API Functions

cusparseAxpby()

cusparseStatus_t
cusparseAxpby(cusparseHandle_t          handle,
              const void*               alpha,
              cusparseConstSpVecDescr_t vecX,  // non-const descriptor supported
              const void*               beta,
              cusparseDnVecDescr_t      vecY)

The function computes the sum of a sparse vector vecX and a dense vector vecY.

\(\mathbf{Y} = \alpha\mathbf{X} + \beta\mathbf{Y}\)

In other words,

for i=0 to n-1
    Y[i] = beta * Y[i]
for i=0 to nnz-1
    Y[X_indices[i]] += alpha * X_values[i]

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of compute type

vecX

HOST

IN

Sparse vector X

beta

HOST or DEVICE

IN

\(\beta\) scalar used for multiplication of compute type

vecY

HOST

IN/OUT

Dense vector Y

cusparseAxpby supports the following index type for representing the sparse vector vecX:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseAxpby supports the following data types:

Uniform-precision computation:

X/Y/compute

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

X/Y

compute

CUDA_R_16F

CUDA_R_32F

CUDA_R_16BF

CUDA_C_16F

CUDA_C_32F

[DEPRECATED]

CUDA_C_16BF

[DEPRECATED]

cusparseAxpby() has the following constraints:

  • The arrays representing the sparse vector vecX must be aligned to 16 bytes

cusparseAxpby() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run if the the sparse vector vecX indices are distinct

  • The routine allows indices of vecX to be unsorted

cusparseAxpby() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseAxpby for a code example.


cusparseGather()

cusparseStatus_t
cusparseGather(cusparseHandle_t          handle,
               cusparseConstDnVecDescr_t vecY,  // non-const descriptor supported
               cusparseSpVecDescr_t      vecX)

The function gathers the elements of the dense vector vecY into the sparse vector vecX

In other words,

for i=0 to nnz-1
    X_values[i] = Y[X_indices[i]]

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

vecX

HOST

OUT

Sparse vector X

vecY

HOST

IN

Dense vector Y

cusparseGather supports the following index type for representing the sparse vector vecX:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseGather supports the following data types:

X/Y

CUDA_R_16F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

cusparseGather() has the following constraints:

  • The arrays representing the sparse vector vecX must be aligned to 16 bytes

cusparseGather() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run if the the sparse vector vecX indices are distinct

  • The routine allows indices of vecX to be unsorted

cusparseGather() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseGather for a code example.


cusparseScatter()

cusparseStatus_t
cusparseScatter(cusparseHandle_t          handle,
                cusparseConstSpVecDescr_t vecX,  // non-const descriptor supported
                cusparseDnVecDescr_t      vecY)

The function scatters the elements of the sparse vector vecX into the dense vector vecY

In other words,

for i=0 to nnz-1
    Y[X_indices[i]] = X_values[i]

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

vecX

HOST

IN

Sparse vector X

vecY

HOST

OUT

Dense vector Y

cusparseScatter supports the following index type for representing the sparse vector vecX:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseScatter supports the following data types:

X/Y

CUDA_R_8I

CUDA_R_16F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

cusparseScatter() has the following constraints:

  • The arrays representing the sparse vector vecX must be aligned to 16 bytes

cusparseScatter() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run if the the sparse vector vecX indices are distinct

  • The routine allows indices of vecX to be unsorted

cusparseScatter() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseScatter for a code example.


cusparseRot() [DEPRECATED]

> The routine will be removed in the next major release

cusparseStatus_t
cusparseRot(cusparseHandle_t     handle,
            const void*          c_coeff,
            const void*          s_coeff,
            cusparseSpVecDescr_t vecX,
            cusparseDnVecDescr_t vecY)

The function computes the Givens rotation matrix

\(G = \begin{bmatrix} c & s \\ {- s} & c \\ \end{bmatrix}\)

to a sparse vecX and a dense vector vecY

In other words,

for i=0 to nnz-1
    Y[X_indices[i]] = c * Y[X_indices[i]] - s * X_values[i]
    X_values[i]     = c * X_values[i]     + s * Y[X_indices[i]]

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

c_coeff

HOST or DEVICE

IN

cosine element of the rotation matrix

vecX

HOST

IN/OUT

Sparse vector X

s_coeff

HOST or DEVICE

IN

sine element of the rotation matrix

vecY

HOST

IN/OUT

Dense vector Y

cusparseRot supports the following index type for representing the sparse vector vecX:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseRot supports the following data types:

Uniform-precision computation:

X/Y/compute

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

X/Y

compute

CUDA_R_16F

CUDA_R_32F

CUDA_R_16BF

CUDA_C_16F

CUDA_C_32F

[DEPRECATED]

CUDA_C_16BF

[DEPRECATED]

cusparseRot() has the following constraints:

  • The arrays representing the sparse vector vecX must be aligned to 16 bytes

cusparseRot() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run if the the sparse vector vecX indices are distinct

cusparseRot() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseRot for a code example.


cusparseSpVV()

cusparseStatus_t
cusparseSpVV_bufferSize(cusparseHandle_t          handle,
                        cusparseOperation_t       opX,
                        cusparseConstSpVecDescr_t vecX,  // non-const descriptor supported
                        cusparseConstDnVecDescr_t vecY,  // non-const descriptor supported
                        void*                     result,
                        cudaDataType              computeType,
                        size_t*                   bufferSize)
cusparseStatus_t
cusparseSpVV(cusparseHandle_t          handle,
             cusparseOperation_t       opX,
             cusparseConstSpVecDescr_t vecX,  // non-const descriptor supported
             cusparseConstDnVecDescr_t vecY,  // non-const descriptor supported
             void*                     result,
             cudaDataType              computeType,
             void*                     externalBuffer)

The function computes the inner dot product of a sparse vector vecX and a dense vector vecY

\(result = op\left(\mathbf{X}\right) \cdot \mathbf{Y}\)

In other words,

result = 0;
for i=0 to nnz-1
    result += op(X_values[i]) * Y[X_indices[i]]

\(\text{op}(X) = \begin{cases} X & \text{if op(X) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ \overline{X} & \text{if op(X) == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)

The function cusparseSpVV_bufferSize() returns the size of the workspace needed by cusparseSpVV()

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opX

HOST

IN

Operation op(X) that is non-transpose or conjugate transpose

vecX

HOST

IN

Sparse vector X

vecY

HOST

IN

Dense vector Y

result

HOST or DEVICE

OUT

The resulting dot product

computeType

HOST

IN

Datatype in which the computation is executed

bufferSize

HOST

OUT

Number of bytes of workspace needed by cusparseSpVV

externalBuffer

DEVICE

IN

Pointer to a workspace buffer of at least bufferSize bytes

cusparseSpVV supports the following index type for representing the sparse vector vecX:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

The data types combinations currently supported for cusparseSpVV are listed below:

Uniform-precision computation:

X/Y/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

X/Y

computeType/result

Notes

CUDA_R_8I

CUDA_R_32I

CUDA_R_8I

CUDA_R_32F

CUDA_R_16F

CUDA_R_32F

CUDA_R_16BF

CUDA_R_32F

CUDA_C_16F

CUDA_C_32F

[DEPRECATED]

CUDA_C_16BF

CUDA_C_32F

[DEPRECATED]

cusparseSpVV() has the following constraints:

  • The arrays representing the sparse vector vecX must be aligned to 16 bytes

cusparseSpVV() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run if the the sparse vector vecX indices are distinct

  • The routine allows indices of vecX to be unsorted

cusparseSpVV() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseSpVV for a code example.


cusparseSpMV()

cusparseStatus_t
cusparseSpMV_bufferSize(cusparseHandle_t          handle,
                        cusparseOperation_t       opA,
                        const void*               alpha,
                        cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                        cusparseConstDnVecDescr_t vecX,  // non-const descriptor supported
                        const void*               beta,
                        cusparseDnVecDescr_t      vecY,
                        cudaDataType              computeType,
                        cusparseSpMVAlg_t         alg,
                        size_t*                   bufferSize)
cusparseStatus_t
cusparseSpMV_preprocess(cusparseHandle_t          handle,
                        cusparseOperation_t       opA,
                        const void*               alpha,
                        cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                        cusparseConstDnVecDescr_t vecX,  // non-const descriptor supported
                        const void*               beta,
                        cusparseDnVecDescr_t      vecY,
                        cudaDataType              computeType,
                        cusparseSpMVAlg_t         alg,
                        void*                     externalBuffer)
cusparseStatus_t
cusparseSpMV(cusparseHandle_t          handle,
             cusparseOperation_t       opA,
             const void*               alpha,
             cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
             cusparseConstDnVecDescr_t vecX,  // non-const descriptor supported
             const void*               beta,
             cusparseDnVecDescr_t      vecY,
             cudaDataType              computeType,
             cusparseSpMVAlg_t         alg,
             void*                     externalBuffer)

This function performs the multiplication of a sparse matrix matA and a dense vector vecX

\(\mathbf{Y} = \alpha op\left( \mathbf{A} \right) \cdot \mathbf{X} + \beta\mathbf{Y}\)

where

  • op(A) is a sparse matrix of size \(m \times k\)

  • X is a dense vector of size \(k\)

  • Y is a dense vector of size \(m\)

  • \(\alpha\) and \(\beta\) are scalars

Also, for matrix A

\(\text{op}(A) == \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if op(A) ==CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)

The function cusparseSpMV_bufferSize() returns the size of the workspace needed by cusparseSpMV_preprocess() and cusparseSpMV()

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operation op(A)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of type computeType

matA

HOST

IN

Sparse matrix A

vecX

HOST

IN

Dense vector X

beta

HOST or DEVICE

IN

\(\beta\) scalar used for multiplication of type computeType

vecY

HOST

IN/OUT

Dense vector Y

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed by cusparseSpMV

externalBuffer

DEVICE

IN

Pointer to a workspace buffer of at least bufferSize bytes

The sparse matrix formats currently supported are listed below:

  • CUSPARSE_FORMAT_COO

  • CUSPARSE_FORMAT_CSR

  • CUSPARSE_FORMAT_CSC

  • CUSPARSE_FORMAT_SLICED_ELL

cusparseSpMV supports the following index type for representing the sparse matrix matA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSpMV supports the following data types:

Uniform-precision computation:

A/X/ Y/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

A/X

Y

computeType

Notes

CUDA_R_8I

CUDA_R_32I

CUDA_R_32I

CUDA_R_8I

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16BF

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_16F

CUDA_C_16F

[DEPRECATED]

CUDA_C_16BF

CUDA_C_16BF

[DEPRECATED]

A

X/Y/computeType

CUDA_R_32F

CUDA_R_64F

Mixed Regular/Complex computation:

A

X/Y/computeType

CUDA_R_32F

CUDA_C_32F

CUDA_R_64F

CUDA_C_64F

NOTE: CUDA_R_16F, CUDA_R_16BF, CUDA_C_16F, and CUDA_C_16BF data types always imply mixed-precision computation.

cusparseSpMV() supports the following algorithms:

Algorithm

Notes

CUSPARSE_SPMV_ALG_DEFAULT

Default algorithm for any sparse matrix format.

CUSPARSE_SPMV_COO_ALG1

Default algorithm for COO sparse matrix format. May produce slightly different results during different runs with the same input parameters.

CUSPARSE_SPMV_COO_ALG2

Provides deterministic (bit-wise) results for each run. If opA != CUSPARSE_OPERATION_NON_TRANSPOSE, it is identical to CUSPARSE_SPMV_COO_ALG1.

CUSPARSE_SPMV_CSR_ALG1

Default algorithm for CSR/CSC sparse matrix format. May produce slightly different results during different runs with the same input parameters.

CUSPARSE_SPMV_CSR_ALG2

Provides deterministic (bit-wise) results for each run. If opA != CUSPARSE_OPERATION_NON_TRANSPOSE, it is identical to CUSPARSE_SPMV_CSR_ALG1.

CUSPARSE_SPMV_SELL_ALG1

Default algorithm for Sliced Ellpack sparse matrix format. Provides deterministic (bit-wise) results for each run.

Performance notes:

  • CUSPARSE_SPMV_COO_ALG1 and CUSPARSE_SPMV_CSR_ALG1 provide higher performance than CUSPARSE_SPMV_COO_ALG2 and CUSPARSE_SPMV_CSR_ALG2.

  • In general, opA == CUSPARSE_OPERATION_NON_TRANSPOSE is 3x faster than opA != CUSPARSE_OPERATION_NON_TRANSPOSE.

  • Using cusparseSpMV_preprocess() helps improve performance of cusparseSpMV() in CSR. It is beneficial when we need to run cusparseSpMV() multiple times with a same matrix (cusparseSpMV_preprocess() is executed only once).

cusparseSpMV() has the following properties:

  • The routine requires extra storage for CSR/CSC format (all algorithms) and for COO format with CUSPARSE_SPMV_COO_ALG2 algorithm.

  • Provides deterministic (bit-wise) results for each run only for CUSPARSE_SPMV_COO_ALG2 and CUSPARSE_SPMV_CSR_ALG2 algorithms, and opA == CUSPARSE_OPERATION_NON_TRANSPOSE.

  • The routine supports asynchronous execution.

  • compute-sanitizer could report false race conditions for this routine when beta == 0. This is for optimization purposes and does not affect the correctness of the computation.

  • The routine allows the indices of matA to be unsorted.

cusparseSpMV() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseSpMV CSR and cusparseSpMV COO for a code example.


cusparseSpSV()

cusparseStatus_t
cusparseSpSV_createDescr(cusparseSpSVDescr_t* spsvDescr);

cusparseStatus_t
cusparseSpSV_destroyDescr(cusparseSpSVDescr_t spsvDescr);
cusparseStatus_t
cusparseSpSV_bufferSize(cusparseHandle_t          handle,
                        cusparseOperation_t       opA,
                        const void*               alpha,
                        cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                        cusparseConstDnVecDescr_t vecX,  // non-const descriptor supported
                        cusparseDnVecDescr_t      vecY,
                        cudaDataType              computeType,
                        cusparseSpSVAlg_t         alg,
                        cusparseSpSVDescr_t       spsvDescr,
                        size_t*                   bufferSize)
cusparseStatus_t
cusparseSpSV_analysis(cusparseHandle_t          handle,
                      cusparseOperation_t       opA,
                      const void*               alpha,
                      cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                      cusparseConstDnVecDescr_t vecX,  // non-const descriptor supported
                      cusparseDnVecDescr_t      vecY,
                      cudaDataType              computeType,
                      cusparseSpSVAlg_t         alg,
                      cusparseSpSVDescr_t       spsvDescr
                      void*                     externalBuffer)
cusparseStatus_t
cusparseSpSV_solve(cusparseHandle_t          handle,
                   cusparseOperation_t       opA,
                   const void*               alpha,
                   cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                   cusparseConstDnVecDescr_t vecX,  // non-const descriptor supported
                   cusparseDnVecDescr_t      vecY,
                   cudaDataType              computeType,
                   cusparseSpSVAlg_t         alg,
                   cusparseSpSVDescr_t       spsvDescr)
cusparseStatus_t
cusparseSpSV_updateMatrix(cusparseHandle_t     handle,
                          cusparseSpSVDescr_t  spsvDescr,
                          void*                newValues,
                          cusparseSpSVUpdate_t updatePart)

The function solves a system of linear equations whose coefficients are represented in a sparse triangular matrix:

\(op\left( \mathbf{A} \right) \cdot \mathbf{Y} = \alpha\mathbf{X}\)

where

  • op(A) is a sparse square matrix of size \(m \times m\)

  • X is a dense vector of size \(m\)

  • Y is a dense vector of size \(m\)

  • \(\alpha\) is a scalar

Also, for matrix A

\(\text{op}(A) = \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if op(A) == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)

The function cusparseSpSV_bufferSize() returns the size of the workspace needed by cusparseSpSV_analysis() and cusparseSpSV_solve(). The function cusparseSpSV_analysis() performs the analysis phase, while cusparseSpSV_solve() executes the solve phase for a sparse triangular linear system. The opaque data structure spsvDescr is used to share information among all functions. The function cusparseSpSV_updateMatrix() updates spsvDescr with new matrix values.

The routine supports arbitrary sparsity for the input matrix, but only the upper or lower triangular part is taken into account in the computation.

NOTE: all parameters must be consistent across cusparseSpSV API calls and the matrix descriptions and externalBuffer must not be modified between cusparseSpSV_analysis() and cusparseSpSV_solve(). The function cusparseSpSV_updateMatrix() can be used to update the values on the sparse matrix stored inside the opaque data structure spsvDescr

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operation op(A)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of type computeType

matA

HOST

IN

Sparse matrix A

vecX

HOST

IN

Dense vector X

vecY

HOST

IN/OUT

Dense vector Y

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed by cusparseSpSV_analysis() and cusparseSpSV_solve()

externalBuffer

DEVICE

IN/OUT

Pointer to a workspace buffer of at least bufferSize bytes. It is used by cusparseSpSV_analysis and cusparseSpSV_solve()

spsvDescr

HOST

IN/OUT

Opaque descriptor for storing internal data used across the three steps

The sparse matrix formats currently supported are listed below:

  • CUSPARSE_FORMAT_CSR

  • CUSPARSE_FORMAT_COO

  • CUSPARSE_FORMAT_SLICED_ELL

The cusparseSpSV() supports the following shapes and properties:

  • CUSPARSE_FILL_MODE_LOWER and CUSPARSE_FILL_MODE_UPPER fill modes

  • CUSPARSE_DIAG_TYPE_NON_UNIT and CUSPARSE_DIAG_TYPE_UNIT diagonal types

The fill mode and diagonal type can be set by cusparseSpMatSetAttribute()

cusparseSpSV() supports the following index type for representing the sparse matrix matA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSpSV() supports the following data types:

Uniform-precision computation:

A/X/ Y/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

cusparseSpSV() supports the following algorithms:

Algorithm

Notes

CUSPARSE_SPSV_ALG_DEFAULT

Default algorithm

cusparseSpSV() has the following properties:

  • The routine requires extra storage for the analysis phase which is proportional to number of non-zero entries of the sparse matrix

  • Provides deterministic (bit-wise) results for each run for the solving phase cusparseSpSV_solve()

  • The routine supports in-place operation

  • The cusparseSpSV_solve() routine supports asynchronous execution

  • cusparseSpSV_bufferSize() and cusparseSpSV_analysis() routines accept NULL for vecX and vecY

  • The routine allows the indices of matA to be unsorted

cusparseSpSV() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

cusparseSpSV_updateMatrix() updates the sparse matrix after calling the analysis phase. This functions supports the following update strategies (updatePart):

Strategy

Notes

CUSPARSE_SPSV_UPDATE_GENERAL

Updates the sparse matrix values with values of newValues array

CUSPARSE_SPSV_UPDATE_DIAGONAL

Updates the diagonal part of the matrix with diagonal values stored in newValues array. That is, newValues has the new diagonal values only

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseSpSV CSR and cuSPARSE Library Samples - cusparseSpSV COO for code examples.


cusparseSpMM()

cusparseStatus_t
cusparseSpMM_bufferSize(cusparseHandle_t          handle,
                        cusparseOperation_t       opA,
                        cusparseOperation_t       opB,
                        const void*               alpha,
                        cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                        cusparseConstDnMatDescr_t matB,  // non-const descriptor supported
                        const void*               beta,
                        cusparseDnMatDescr_t      matC,
                        cudaDataType              computeType,
                        cusparseSpMMAlg_t         alg,
                        size_t*                   bufferSize)
cusparseStatus_t
cusparseSpMM_preprocess(cusparseHandle_t          handle,
                        cusparseOperation_t       opA,
                        cusparseOperation_t       opB,
                        const void*               alpha,
                        cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                        cusparseConstDnMatDescr_t matB,  // non-const descriptor supported
                        const void*               beta,
                        cusparseDnMatDescr_t      matC,
                        cudaDataType              computeType,
                        cusparseSpMMAlg_t         alg,
                        void*                     externalBuffer)
cusparseStatus_t
cusparseSpMM(cusparseHandle_t          handle,
             cusparseOperation_t       opA,
             cusparseOperation_t       opB,
             const void*               alpha,
             cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
             cusparseConstDnMatDescr_t matB,  // non-const descriptor supported
             const void*               beta,
             cusparseDnMatDescr_t      matC,
             cudaDataType              computeType,
             cusparseSpMMAlg_t         alg,
             void*                     externalBuffer)

The function performs the multiplication of a sparse matrix matA and a dense matrix matB.

\(\mathbf{C} = \alpha op\left( \mathbf{A} \right) \cdot op\left( \mathbf{B} \right) + \beta\mathbf{C}\)

where

  • op(A) is a sparse matrix of size \(m \times k\)

  • op(B) is a dense matrix of size \(k \times n\)

  • C is a dense matrix of size \(m \times n\)

  • \(\alpha\) and \(\beta\) are scalars

The routine can be also used to perform the multiplication of a dense matrix and a sparse matrix by switching the dense matrices layout:

\(\begin{array}{l} \left. \mathbf{C}_{C} = \mathbf{B}_{C} \cdot \mathbf{A} + \beta\mathbf{C}_{C}\rightarrow \right. \\ {\mathbf{C}_{R} = \mathbf{A}^{T} \cdot \mathbf{B}_{R} + \beta\mathbf{C}_{R}} \\ \end{array}\)

where \(\mathbf{B}_{C}\) , \(\mathbf{C}_{C}\) indicate column-major layout, while \(\mathbf{B}_{R}\) , \(\mathbf{C}_{R}\) refer to row-major layout

Also, for matrix A and B

\(\text{op}(A) = \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if op(A) == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)

\(\text{op}(B) = \begin{cases} B & \text{if op(B) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ B^{T} & \text{if op(B) == CUSPARSE_OPERATION_TRANSPOSE} \\ B^{H} & \text{if op(B) == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)

When using the (conjugate) transpose of the sparse matrix A, this routine may produce slightly different results during different runs with the same input parameters.

The function cusparseSpMM_bufferSize() returns the size of the workspace needed by cusparseSpMM()

The function cusparseSpMM_preprocess() can be called before cusparseSpMM to speedup the actual computation. It is useful when cusparseSpMM is called multiple times with the same sparsity pattern (matA). The values of the matrices (matA, matB, matC) can change arbitrarily. It provides performance advantages is used with CUSPARSE_SPMM_CSR_ALG1 or CUSPARSE_SPMM_CSR_ALG3. For all other formats and algorithms have no effect.

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operation op(A)

opB

HOST

IN

Operation op(B)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of type computeType

matA

HOST

IN

Sparse matrix A

matB

HOST

IN

Dense matrix B

beta

HOST or DEVICE

IN

\(\beta\) scalar used for multiplication of type computeType

matC

HOST

IN/OUT

Dense matrix C

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed by cusparseSpMM

externalBuffer

DEVICE

IN

Pointer to workspace buffer of at least bufferSize bytes

cusparseSpMM supports the following sparse matrix formats:

  • CUSPARSE_FORMAT_COO

  • CUSPARSE_FORMAT_CSR

  • CUSPARSE_FORMAT_CSC

  • CUSPARSE_FORMAT_BLOCKED_ELL

(1)

COO/CSR/CSC FORMATS

cusparseSpMM supports the following index type for representing the sparse matrix matA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSpMM supports the following data types:

Uniform-precision computation:

A/B/ C/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

A/B

C

computeType

CUDA_R_8I

CUDA_R_32I

CUDA_R_32I

CUDA_R_8I

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16BF

CUDA_C_16F

CUDA_C_16F

CUDA_C_32F

[DEPRECATED]

CUDA_C_16BF

CUDA_C_16BF

[DEPRECATED]

NOTE: CUDA_R_16F, CUDA_R_16BF, CUDA_C_16F, and CUDA_C_16BF data types always imply mixed-precision computation.

cusparseSpMM supports the following algorithms:

Algorithm

Notes

CUSPARSE_SPMM_ALG_DEFAULT

Default algorithm for any sparse matrix format

CUSPARSE_SPMM_COO_ALG1

Algorithm 1 for COO sparse matrix format

  • May provide better performance for small number of nnz

  • Provides the best performance with column-major layout

  • It supports batched computation

  • May produce slightly different results during different runs with the same input parameters

CUSPARSE_SPMM_COO_ALG2

Algorithm 2 for COO sparse matrix format

  • It provides deterministic result

  • Provides the best performance with column-major layout

  • In general, slower than Algorithm 1

  • It supports batched computation

  • It requires additional memory

  • If opA != CUSPARSE_OPERATION_NON_TRANSPOSE, it is identical to CUSPARSE_SPMM_COO_ALG1

CUSPARSE_SPMM_COO_ALG3

Algorithm 3 for COO sparse matrix format

  • May provide better performance for large number of nnz

  • May produce slightly different results during different runs with the same input parameters

CUSPARSE_SPMM_COO_ALG4

Algorithm 4 for COO sparse matrix format

  • Provides better performance with row-major layout

  • It supports batched computation

  • May produce slightly different results during different runs with the same input parameters

CUSPARSE_SPMM_CSR_ALG1

Algorithm 1 for CSR/CSC sparse matrix format

  • Provides the best performance with column-major layout

  • It supports batched computation

  • It requires additional memory

  • May produce slightly different results during different runs with the same input parameters

CUSPARSE_SPMM_CSR_ALG2

Algorithm 2 for CSR/CSC sparse matrix format

  • Provides the best performance with row-major layout

  • It supports batched computation

  • It requires additional memory

  • May produce slightly different results during different runs with the same input parameters

CUSPARSE_SPMM_CSR_ALG3

Algorithm 3 for CSR/CSC sparse matrix format

  • It provides deterministic result

  • It requires additional memory

  • It supports only opA == CUSPARSE_OPERATION_NON_TRANSPOSE

  • It does not support opB == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE

  • It does not support CUDA_C_16F and CUDA_C_16BF data types

Performance notes:

  • Row-major layout provides higher performance than column-major

  • CUSPARSE_SPMM_COO_ALG4 and CUSPARSE_SPMM_CSR_ALG2 should be used with row-major layout, while CUSPARSE_SPMM_COO_ALG1, CUSPARSE_SPMM_COO_ALG2, CUSPARSE_SPMM_COO_ALG3, and CUSPARSE_SPMM_CSR_ALG1 with column-major layout

  • For beta != 1, the output matrix is scaled before the actual computation

  • For n == 1, the routine uses cusparseSpMV() as fallback

cusparseSpMM() with all algorithms support the following batch modes except for CUSPARSE_SPMM_CSR_ALG3:

  • \(C_{i} = A \cdot B_{i}\)

  • \(C_{i} = A_{i} \cdot B\)

  • \(C_{i} = A_{i} \cdot B_{i}\)

The number of batches and their strides can be set by using cusparseCooSetStridedBatch, cusparseCsrSetStridedBatch, and cusparseDnMatSetStridedBatch. The maximum number of batches for cusparseSpMM() is 65,535.

cusparseSpMM() has the following properties:

  • The routine requires no extra storage for CUSPARSE_SPMM_COO_ALG1, CUSPARSE_SPMM_COO_ALG3, CUSPARSE_SPMM_COO_ALG4

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run only for CUSPARSE_SPMM_COO_ALG2 and CUSPARSE_SPMM_CSR_ALG3 algorithms

  • compute-sanitizer could report false race conditions for this routine. This is for optimization purposes and does not affect the correctness of the computation

  • The routine allows the indices of matA to be unsorted

cusparseSpMM() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

Please visit cuSPARSE Library Samples - cusparseSpMM CSR and cusparseSpMM COO for a code example. For batched computation please visit cusparseSpMM CSR Batched and cusparseSpMM COO Batched.

(2)

BLOCKED-ELLPACK FORMAT

cusparseSpMM supports the following data types for CUSPARSE_FORMAT_BLOCKED_ELL format and the following GPU architectures for exploiting NVIDIA Tensor Cores:

A/B

C

computeType

opB

Compute Capability

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

N, T

70

CUDA_R_16F

CUDA_R_16F

CUDA_R_32F

N, T

70

CUDA_R_16F

CUDA_R_32F

CUDA_R_32F

N, T

70

CUDA_R_8I

CUDA_R_32I

CUDA_R_32I

N column-major

75

T row-major

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_32F

N, T

80

CUDA_R_16BF

CUDA_R_32F

CUDA_R_32F

N, T

80

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

N, T

80

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

N, T

80

cusparseSpMM supports the following algorithms with CUSPARSE_FORMAT_BLOCKED_ELL format:

Algorithm

Notes

CUSPARSE_SPMM_ALG_DEFAULT

Default algorithm for any sparse matrix format

CUSPARSE_SPMM_BLOCKED_ELL_ALG1

Default algorithm for Blocked-ELL format

Performance notes:

  • Blocked-ELL SpMM provides the best performance with Power-of-2 Block-Sizes.

  • Large Block-Sizes (e.g. ≥ 64) provide the best performance.

The function has the following limitations:

  • The pointer mode must be equal to CUSPARSE_POINTER_MODE_HOST

  • Only opA == CUSPARSE_OPERATION_NON_TRANSPOSE is supported.

  • opB == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE is not supported.

  • Only CUSPARSE_INDEX_32I is supported.

Please visit cuSPARSE Library Samples - cusparseSpMM Blocked-ELL for a code example.

See cusparseStatus_t for the description of the return status.


cusparseSpMMOp()

cusparseStatus_t CUSPARSEAPI
cusparseSpMMOp_createPlan(cusparseHandle_t          handle,
                          cusparseSpMMOpPlan_t*     plan,
                          cusparseOperation_t       opA,
                          cusparseOperation_t       opB,
                          cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                          cusparseConstDnMatDescr_t matB,  // non-const descriptor supported
                          cusparseDnMatDescr_t      matC,
                          cudaDataType              computeType,
                          cusparseSpMMOpAlg_t       alg,
                          const void*               addOperationNvvmBuffer,
                          size_t                    addOperationBufferSize,
                          const void*               mulOperationNvvmBuffer,
                          size_t                    mulOperationBufferSize,
                          const void*               epilogueNvvmBuffer,
                          size_t                    epilogueBufferSize,
                          size_t*                   SpMMWorkspaceSize)

cusparseStatus_t
cusparseSpMMOp_destroyPlan(cusparseSpMMOpPlan_t plan)
cusparseStatus_t
cusparseSpMMOp(cusparseSpMMOpPlan_t plan,
               void*                externalBuffer)

NOTE 1: NVRTC and nvJitLink are not currently available on Arm64 Android platforms.

NOTE 2: The routine does not support Android and Tegra platforms except Judy (sm87).

Experimental: The function performs the multiplication of a sparse matrix matA and a dense matrix matB with custom operators.

\({C^{\prime}}_{ij} = \text{epilogue}\left( {\sum_{k}^{\oplus}{op\left( A_{ik} \right) \otimes op\left( B_{kj} \right),C_{ij}}} \right)\)

where

  • op(A) is a sparse matrix of size \(m \times k\)

  • op(B) is a dense matrix of size \(k \times n\)

  • C is a dense matrix of size \(m \times n\)

  • \(\oplus\) , \(\otimes\) , and \(\text{epilogue}\) are custom add, mul, and epilogue operators respectively.

Also, for matrix A and B

\(\text{op}(A) = \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ \end{cases}\)

\(\text{op}(B) = \begin{cases} B & {\text{if op(}B\text{) == CUSPARSE_OPERATION_NON_TRANSPOSE}} \\ B^{T} & {\text{if op(}B\text{) == CUSPARSE_OPERATION_TRANSPOSE}} \\ \end{cases}\)

Only opA == CUSPARSE_OPERATION_NON_TRANSPOSE is currently supported

The function cusparseSpMMOp_createPlan() returns the size of the workspace and the compiled kernel needed by cusparseSpMMOp()

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operation op(A)

opB

HOST

IN

Operation op(B)

matA

HOST

IN

Sparse matrix A

matB

HOST

IN

Dense matrix B

matC

HOST

IN/OUT

Dense matrix C

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

addOperationNvvmBuffer

HOST

IN

Pointer to the NVVM buffer containing the custom add operator

addOperationBufferSize

HOST

IN

Size in bytes of addOperationNvvmBuffer

mulOperationNvvmBuffer

HOST

IN

Pointer to the NVVM buffer containing the custom mul operator

mulOperationBufferSize

HOST

IN

Size in bytes of mulOperationNvvmBuffer

epilogueNvvmBuffer

HOST

IN

Pointer to the NVVM buffer containing the custom epilogue operator

epilogueBufferSize

HOST

IN

Size in bytes of epilogueNvvmBuffer

SpMMWorkspaceSize

HOST

OUT

Number of bytes of workspace needed by cusparseSpMMOp

The operators must have the following signature and return type

__device__ <computetype> add_op(<computetype> value1, <computetype> value2);

__device__ <computetype> mul_op(<computetype> value1, <computetype> value2);

__device__ <computetype> epilogue(<computetype> value1, <computetype> value2);

<computetype> is one of float, double, cuComplex, cuDoubleComplex, or int,

cusparseSpMMOp supports the following sparse matrix formats:

  • CUSPARSE_FORMAT_CSR

cusparseSpMMOp supports the following index type for representing the sparse matrix matA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSpMMOp supports the following data types:

Uniform-precision computation:

A/B/ C/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

A/B

C

computeType

CUDA_R_8I

CUDA_R_32I

CUDA_R_32I

CUDA_R_8I

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16BF

cusparseSpMMOp supports the following algorithms:

Algorithm

Notes

CUSPARSE_SPMM_OP_ALG_DEFAULT

Default algorithm for any sparse matrix format

Performance notes:

  • Row-major layout provides higher performance than column-major.

cusparseSpMMOp() has the following properties:

  • The routine requires extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run

  • The routine allows the indices of matA to be unsorted

cusparseSpMMOp() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

Please visit cuSPARSE Library Samples - cusparseSpMMOp

See cusparseStatus_t for the description of the return status.


cusparseSpSM()

cusparseStatus_t
cusparseSpSM_createDescr(cusparseSpSMDescr_t* spsmDescr);
cusparseStatus_t
cusparseSpSM_destroyDescr(cusparseSpSMDescr_t spsmDescr);
cusparseStatus_t
cusparseSpSM_bufferSize(cusparseHandle_t     handle,
                        cusparseOperation_t  opA,
                        cusparseOperation_t  opB,
                        const void*          alpha,
                        cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                        cusparseConstDnMatDescr_t matB,  // non-const descriptor supported
                        cusparseDnMatDescr_t matC,
                        cudaDataType         computeType,
                        cusparseSpSMAlg_t    alg,
                        cusparseSpSMDescr_t  spsmDescr,
                        size_t*              bufferSize)
cusparseStatus_t
cusparseSpSM_analysis(cusparseHandle_t          handle,
                      cusparseOperation_t       opA,
                      cusparseOperation_t       opB,
                      const void*               alpha,
                      cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                      cusparseConstDnMatDescr_t matB,  // non-const descriptor supported
                      cusparseDnMatDescr_t      matC,
                      cudaDataType              computeType,
                      cusparseSpSMAlg_t         alg,
                      cusparseSpSMDescr_t       spsmDescr,
                      void*                     externalBuffer)
cusparseStatus_t
cusparseSpSM_solve(cusparseHandle_t          handle,
                   cusparseOperation_t       opA,
                   cusparseOperation_t       opB,
                   const void*               alpha,
                   cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                   cusparseConstDnMatDescr_t matB,  // non-const descriptor supported
                   cusparseDnMatDescr_t      matC,
                   cudaDataType              computeType,
                   cusparseSpSMAlg_t         alg,
                   cusparseSpSMDescr_t       spsmDescr)
cusparseStatus_t
cusparseSpSM_updateMatrix(cusparseHandle_t     handle,
                          cusparseSpSMDescr_t  spsmDescr,
                          void*                newValues,
                          cusparseSpSMUpdate_t updatePart)

The function solves a system of linear equations whose coefficients are represented in a sparse triangular matrix:

\(op\left( \mathbf{A} \right) \cdot \mathbf{C} = \mathbf{\alpha}op\left( \mathbf{B} \right)\)

where

  • op(A) is a sparse square matrix of size \(m \times m\)

  • op(B) is a dense matrix of size \(m \times n\)

  • C is a dense matrix of size \(m \times n\)

  • \(\alpha\) is a scalar

Also, for matrix A

\(\text{op}(A) = \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if op(A) == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)

\(\text{op}(B) = \begin{cases} B & \text{if op(B) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ B^{T} & \text{if op(B) == CUSPARSE_OPERATION_TRANSPOSE} \\ & \text{ } \\ \end{cases}\)

The function cusparseSpSM_bufferSize() returns the size of the workspace needed by cusparseSpSM_analysis() and cusparseSpSM_solve(). The function cusparseSpSM_analysis() performs the analysis phase, while cusparseSpSM_solve() executes the solve phase for a sparse triangular linear system. The opaque data structure spsmDescr is used to share information among all functions. The function cusparseSpSM_updateMatrix() updates spsmDescr with new matrix values.

The routine supports arbitrary sparsity for the input matrix, but only the upper or lower triangular part is taken into account in the computation.

cusparseSpSM_bufferSize() requires a buffer size for the analysis phase which is proportional to number of non-zero entries of the sparse matrix

The externalBuffer is stored into spsmDescr and used by cusparseSpSM_solve(). For this reason, the device memory buffer must be deallocated only after cusparseSpSM_solve()

NOTE: all parameters must be consistent across cusparseSpSM API calls and the matrix descriptions and externalBuffer must not be modified between cusparseSpSM_analysis() and cusparseSpSM_solve()

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operation op(A)

opB

HOST

IN

Operation op(B)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of type computeType

matA

HOST

IN

Sparse matrix A

matB

HOST

IN

Dense matrix B

matC

HOST

IN/OUT

Dense matrix C

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed by cusparseSpSM_analysis() and cusparseSpSM_solve()

externalBuffer

DEVICE

IN/OUT

Pointer to a workspace buffer of at least bufferSize bytes. It is used by cusparseSpSM_analysis and cusparseSpSM_solve()

spsmDescr

HOST

IN/OUT

Opaque descriptor for storing internal data used across the three steps

The sparse matrix formats currently supported are listed below:

  • CUSPARSE_FORMAT_CSR

  • CUSPARSE_FORMAT_COO

The cusparseSpSM() supports the following shapes and properties:

  • CUSPARSE_FILL_MODE_LOWER and CUSPARSE_FILL_MODE_UPPER fill modes

  • CUSPARSE_DIAG_TYPE_NON_UNIT and CUSPARSE_DIAG_TYPE_UNIT diagonal types

The fill mode and diagonal type can be set by cusparseSpMatSetAttribute()

cusparseSpSM() supports the following index type for representing the sparse matrix matA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSpSM() supports the following data types:

Uniform-precision computation:

A/B/ C/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

cusparseSpSM() supports the following algorithms:

Algorithm

Notes

CUSPARSE_SPSM_ALG_DEFAULT

Default algorithm

cusparseSpSM() has the following properties:

  • The routine requires no extra storage

  • Provides deterministic (bit-wise) results for each run for the solving phase cusparseSpSM_solve()

  • The cusparseSpSM_solve() routine supports asynchronous execution

  • The routine supports in-place operation. The same device pointer must be provided to the values parameter of the dense matrices matB and matC. All other dense matrix descriptor parameters (e.g., order) can be set independently

  • cusparseSpSM_bufferSize() and cusparseSpSM_analysis() routines accept descriptors of NULL values for matB and matC. These two routines do not accept NULL descriptors

  • The routine allows the indices of matA to be unsorted

cusparseSpSM() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

cusparseSpSM_updateMatrix() updates the sparse matrix after calling the analysis phase. This functions supports the following update strategies (updatePart):

Strategy

Notes

CUSPARSE_SPSM_UPDATE_GENERAL

Updates the sparse matrix values with values of newValues array

CUSPARSE_SPSM_UPDATE_DIAGONAL

Updates the diagonal part of the matrix with diagonal values stored in newValues array. That is, newValues has the new diagonal values only

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseSpSM CSR and cuSPARSE Library Samples - cusparseSpSM COO for code examples.


cusparseSDDMM()

cusparseStatus_t
cusparseSDDMM_bufferSize(cusparseHandle_t          handle,
                         cusparseOperation_t       opA,
                         cusparseOperation_t       opB,
                         const void*               alpha,
                         cusparseConstDnMatDescr_t matA,  // non-const descriptor supported
                         cusparseConstDnMatDescr_t matB,  // non-const descriptor supported
                         const void*               beta,
                         cusparseSpMatDescr_t      matC,
                         cudaDataType              computeType,
                         cusparseSDDMMAlg_t        alg,
                         size_t*                   bufferSize)
cusparseStatus_t
cusparseSDDMM_preprocess(cusparseHandle_t          handle,
                         cusparseOperation_t       opA,
                         cusparseOperation_t       opB,
                         const void*               alpha,
                         cusparseConstDnMatDescr_t matA,  // non-const descriptor supported
                         cusparseConstDnMatDescr_t matB,  // non-const descriptor supported
                         const void*               beta,
                         cusparseSpMatDescr_t      matC,
                         cudaDataType              computeType,
                         cusparseSDDMMAlg_t        alg,
                         void*                     externalBuffer)
cusparseStatus_t
cusparseSDDMM(cusparseHandle_t          handle,
              cusparseOperation_t       opA,
              cusparseOperation_t       opB,
              const void*               alpha,
              cusparseConstDnMatDescr_t matA,  // non-const descriptor supported
              cusparseConstDnMatDescr_t matB,  // non-const descriptor supported
              const void*               beta,
              cusparseSpMatDescr_t      matC,
              cudaDataType              computeType,
              cusparseSDDMMAlg_t        alg,
              void*                     externalBuffer)

This function performs the multiplication of matA and matB, followed by an element-wise multiplication with the sparsity pattern of matC. Formally, it performs the following operation:

\(\mathbf{C} = \alpha({op}(\mathbf{A}) \cdot {op}(\mathbf{B})) \circ {spy}(\mathbf{C}) + \beta\mathbf{C}\)

where

  • op(A) is a dense matrix of size \(m \times k\)

  • op(B) is a dense matrix of size \(k \times n\)

  • C is a sparse matrix of size \(m \times n\)

  • \(\alpha\) and \(\beta\) are scalars

  • \(\circ\) denotes the Hadamard (entry-wise) matrix product, and \({spy}\left( \mathbf{C} \right)\) is the structural sparsity pattern matrix of C defined as:

    \({spy}(\mathbf{C})_{ij} = \begin{cases} 1 & {\text{if}\,\mathbf{C}_{ij}\,\text{is an entry stored in}\,\texttt{matC}} \\ 0 & \text{otherwise} \\ \end{cases}.\)

Also, for matrix A and B

\(\text{op}(A) = \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ \end{cases}\)

\(\text{op}(B) = \begin{cases} B & \text{if op(B) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ B^{T} & \text{if op(B) == CUSPARSE_OPERATION_TRANSPOSE} \\ \end{cases}\)

The function cusparseSDDMM_bufferSize() returns the size of the workspace needed by cusparseSDDMM or cusparseSDDMM_preprocess.

The function cusparseSDDMM_preprocess() can be called before cusparseSDDMM to speedup the actual computation. It is useful when cusparseSDDMM is called multiple times with the same sparsity pattern (matC). The values of the dense matrices (matA, matB) can change arbitrarily.

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operation op(A)

opB

HOST

IN

Operation op(B)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of type computeType

matA

HOST

IN

Dense matrix matA

matB

HOST

IN

Dense matrix matB

beta

HOST or DEVICE

IN

\(\beta\) scalar used for multiplication of type computeType

matC

HOST

IN/OUT

Sparse matrix matC

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed by cusparseSDDMM

externalBuffer

DEVICE

IN

Pointer to a workspace buffer of at least bufferSize bytes

Currently supported sparse matrix formats:

  • CUSPARSE_FORMAT_CSR

  • CUSPARSE_FORMAT_BSR

cusparseSDDMM() supports the following index type for representing the sparse matrix matA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

The data types combinations currently supported for cusparseSDDMM are listed below:

Uniform-precision computation:

A/X/ Y/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

A/B

C

computeType

CUDA_R_16F

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_16F

cusparseSDDMM for CUSPARSE_FORMAT_BSR also supports the following mixed-precision computation:

A/B

C

computeType

CUDA_R_16BF

CUDA_R_32F

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

NOTE: CUDA_R_16F, CUDA_R_16BF data types always imply mixed-precision computation.

cusparseSDDMM() for CUSPASRE_FORMAT_BSR supports block sizes of 2, 4, 8, 16, 32, 64 and 128.

cusparseSDDMM() supports the following algorithms:

Algorithm

Notes

CUSPARSE_SDDMM_ALG_DEFAULT

Default algorithm. It supports batched computation.

Performance notes: cuspaseSDDMM() for CUSPARSE_FORMAT_CSR provides the best performance when matA and matB satisfy:

  • matA:

    • matA is in row-major order and opA is CUSPARSE_OPERATION_NON_TRANSPOSE, or

    • matA is in col-major order and opA is not CUSPARSE_OPERATION_NON_TRANSPOSE

  • matB:

    • matB is in col-major order and opB is CUSPARSE_OPERATION_NON_TRANSPOSE, or

    • matB is in row-major order and opB is not CUSPARSE_OPERATION_NON_TRANSPOSE

cuspaseSDDMM() for CUSPARSE_FORMAT_BSR provides the best performance when matA and matB satisfy:

  • matA:

    • matA is in row-major order and opA is CUSPARSE_OPERATION_NON_TRANSPOSE, or

    • matA is in col-major order and opA is not CUSPARSE_OPERATION_NON_TRANSPOSE

  • matB:

    • matB is in row-major order and opB is CUSPARSE_OPERATION_NON_TRANSPOSE, or

    • matB is in col-major order and opB is not CUSPARSE_OPERATION_NON_TRANSPOSE

cusparseSDDMM() supports the following batch modes:

  • \(C_{i} = (A \cdot B) \circ C_{i}\)

  • \(C_{i} = \left( A_{i} \cdot B \right) \circ C_{i}\)

  • \(C_{i} = \left( A \cdot B_{i} \right) \circ C_{i}\)

  • \(C_{i} = \left( A_{i} \cdot B_{i} \right) \circ C_{i}\)

The number of batches and their strides can be set by using cusparseCsrSetStridedBatch and cusparseDnMatSetStridedBatch. The maximum number of batches for cusparseSDDMM() is 65,535.

cusparseSDDMM() has the following properties:

  • The routine requires no extra storage

  • Provides deterministic (bit-wise) results for each run

  • The routine supports asynchronous execution

  • The routine allows the indices of matC to be unsorted

cusparseSDDMM() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseSDDMM for a code example. For batched computation please visit cusparseSDDMM CSR Batched.


cusparseSpGEMM()

cusparseStatus_t
cusparseSpGEMM_createDescr(cusparseSpGEMMDescr_t* descr)

cusparseStatus_t
cusparseSpGEMM_destroyDescr(cusparseSpGEMMDescr_t descr)
cusparseStatus_t
cusparseSpGEMM_workEstimation(cusparseHandle_t          handle,
                              cusparseOperation_t       opA,
                              cusparseOperation_t       opB,
                              const void*               alpha,
                              cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                              cusparseConstSpMatDescr_t matB,  // non-const descriptor supported
                              const void*               beta,
                              cusparseSpMatDescr_t      matC,
                              cudaDataType              computeType,
                              cusparseSpGEMMAlg_t       alg,
                              cusparseSpGEMMDescr_t     spgemmDescr,
                              size_t*                   bufferSize1,
                              void*                     externalBuffer1)

cusparseStatus_t
cusparseSpGEMM_getNumProducts(cusparseSpGEMMDescr_t spgemmDescr,
                              int64_t*              num_prods)

cusparseStatus_t
cusparseSpGEMM_estimateMemory(cusparseHandle_t          handle,
                              cusparseOperation_t       opA,
                              cusparseOperation_t       opB,
                              const void*               alpha,
                              cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                              cusparseConstSpMatDescr_t matB,  // non-const descriptor supported
                              const void*               beta,
                              cusparseSpMatDescr_t      matC,
                              cudaDataType              computeType,
                              cusparseSpGEMMAlg_t       alg,
                              cusparseSpGEMMDescr_t     spgemmDescr,
                              float                     chunk_fraction,
                              size_t*                   bufferSize3,
                              void*                     externalBuffer3,
                              size_t*                   bufferSize2)

cusparseStatus_t
cusparseSpGEMM_compute(cusparseHandle_t          handle,
                       cusparseOperation_t       opA,
                       cusparseOperation_t       opB,
                       const void*               alpha,
                       cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                       cusparseConstSpMatDescr_t matB,  // non-const descriptor supported
                       const void*               beta,
                       cusparseSpMatDescr_t      matC,
                       cudaDataType              computeType,
                       cusparseSpGEMMAlg_t       alg,
                       cusparseSpGEMMDescr_t     spgemmDescr,
                       size_t*                   bufferSize2,
                       void*                     externalBuffer2)

cusparseStatus_t
cusparseSpGEMM_copy(cusparseHandle_t          handle,
                    cusparseOperation_t       opA,
                    cusparseOperation_t       opB,
                    const void*               alpha,
                    cusparseConstSpMatDescr_t matA,  // non-const descriptor supported
                    cusparseConstSpMatDescr_t matB,  // non-const descriptor supported
                    const void*               beta,
                    cusparseSpMatDescr_t      matC,
                    cudaDataType              computeType,
                    cusparseSpGEMMAlg_t       alg,
                    cusparseSpGEMMDescr_t     spgemmDescr)

This function performs the multiplication of two sparse matrices matA and matB.

\(\mathbf{C^{\prime}} = \alpha op\left( \mathbf{A} \right) \cdot op\left( \mathbf{B} \right) + \beta\mathbf{C}\)

where \(\alpha,\) \(\beta\) are scalars, and \(\mathbf{C},\) \(\mathbf{C^{\prime}}\) have the same sparsity pattern.

The functions cusparseSpGEMM_workEstimation(), cusparseSpGEMM_estimateMemory(), and cusparseSpGEMM_compute() are used for both determining the buffer size and performing the actual computation.

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operation op(A)

opB

HOST

IN

Operation op(B)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication

matA

HOST

IN

Sparse matrix A

matB

HOST

IN

Sparse matrix B

beta

HOST or DEVICE

IN

\(\beta\) scalar used for multiplication

matC

HOST

IN/OUT

Sparse matrix C

computeType

HOST

IN

Enumerator specifying the datatype in which the computation is executed

alg

HOST

IN

Enumerator specifying the algorithm for the computation

spgemmDescr

HOST

IN/OUT

Opaque descriptor for storing internal data used across the three steps

num_prods

HOST

OUT

Pointer to a 64-bit integer that stores the number of intermediate products calculated by cusparseSpGEMM_workEstimation

chunk_fraction

HOST

IN

The fraction of total intermediate products being computed in a chunk. Used by CUSPARSE_SPGEMM_ALG3 only. Value is in range (0,1].

bufferSize1

HOST

IN/OUT

Number of bytes of workspace requested by cusparseSpGEMM_workEstimation

bufferSize2

HOST

IN/OUT

Number of bytes of workspace requested by cusparseSpGEMM_compute

bufferSize3

HOST

IN/OUT

Number of bytes of workspace requested by cusparseSpGEMM_estimateMemory

externalBuffer1

DEVICE

IN

Pointer to workspace buffer needed by cusparseSpGEMM_workEstimation and cusparseSpGEMM_compute

externalBuffer2

DEVICE

IN

Pointer to workspace buffer needed by cusparseSpGEMM_compute and cusparseSpGEMM_copy

externalBuffer3

DEVICE

IN

Pointer to workspace buffer needed by cusparseSpGEMM_estimateMemory

Currently, the function has the following limitations:

  • Only 32-bit indices CUSPARSE_INDEX_32I is supported

  • Only CSR format CUSPARSE_FORMAT_CSR is supported

  • Only opA, opB equal to CUSPARSE_OPERATION_NON_TRANSPOSE are supported

The data types combinations currently supported for cusparseSpGEMM are listed below :

Uniform-precision computation:

A/B/ C/computeType

CUDA_R_16F [DEPRECATED]

CUDA_R_16BF [DEPRECATED]

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

cusparseSpGEMM routine runs for the following algorithms:

Algorithm

Notes

CUSPARSE_SPGEMM_DEFAULT

Default algorithm. Currently, it is CUSPARSE_SPGEMM_ALG1.

CUSPARSE_SPGEMM_ALG1

Algorithm 1

  • Invokes cusparseSpGEMM_compute twice. The first invocation provides an upper bound of the memory required for the computation.

  • The required memory is generally several times larger of the actual memory used.

  • The user can provide an arbitrary buffer size bufferSize2 in the second invocation. If it is not sufficient, the routine will returns CUSPARSE_STATUS_INSUFFICIENT_RESOURCES status.

  • Provides better performance than other algorithms.

  • Provides deterministic (bit-wise) results for each run.

CUSPARSE_SPGEMM_ALG2

Algorithm 2

  • Invokes cusparseSpGEMM_estimateMemory to get the amount of the memory required for the computation.

  • Requires less memory for the computation than Algorithm 1.

  • Performance is lower than Algorithm 1, higher than Algorithm 3.

  • Provides deterministic (bit-wise) results for each run.

CUSPARSE_SPGEMM_ALG3

Algorithm 3

  • Computes the intermediate products in chunks, one chunk at a time.

  • Invokes cusparseSpGEMM_estimateMemory to get the amount of the memory required for the computation.

  • The user can control the amount of required memory by changing the chunk size via chunk_fraction.

  • The chunk size is a fraction of total intermediate products: chunk_fraction * (*num_prods).

  • Provides deterministic (bit-wise) results for each run.

cusparseSpGEMM() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine allows the indices of matA and matB to be unsorted

  • The routine guarantees the indices of matC to be sorted

cusparseSpGEMM() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseSpGEMM for a code example for CUSPARSE_SPGEMM_DEFAULT and CUSPARSE_SPGEMM_ALG1, and cuSPARSE Library Samples - memory-optimzed cusparseSpGEMM for a code example for CUSPARSE_SPGEMM_ALG2 and CUSPARSE_SPGEMM_ALG3.


cusparseSpGEMMreuse()

cusparseStatus_t
cusparseSpGEMM_createDescr(cusparseSpGEMMDescr_t* descr)

cusparseStatus_t
cusparseSpGEMM_destroyDescr(cusparseSpGEMMDescr_t descr)
cusparseStatus_t
cusparseSpGEMMreuse_workEstimation(cusparseHandle_t      handle,
                                   cusparseOperation_t   opA,
                                   cusparseOperation_t   opB,
                                   cusparseSpMatDescr_t  matA,  // non-const descriptor supported
                                   cusparseSpMatDescr_t  matB,  // non-const descriptor supported
                                   cusparseSpMatDescr_t  matC,
                                   cusparseSpGEMMAlg_t   alg,
                                   cusparseSpGEMMDescr_t spgemmDescr,
                                   size_t*               bufferSize1,
                                   void*                 externalBuffer1)

cusparseStatus_t
cusparseSpGEMMreuse_nnz(cusparseHandle_t      handle,
                        cusparseOperation_t   opA,
                        cusparseOperation_t   opB,
                        cusparseSpMatDescr_t  matA,  // non-const descriptor supported
                        cusparseSpMatDescr_t  matB,  // non-const descriptor supported
                        cusparseSpMatDescr_t  matC,
                        cusparseSpGEMMAlg_t   alg,
                        cusparseSpGEMMDescr_t spgemmDescr,
                        size_t*               bufferSize2,
                        void*                 externalBuffer2,
                        size_t*               bufferSize3,
                        void*                 externalBuffer3,
                        size_t*               bufferSize4,
                        void*                 externalBuffer4)

cusparseStatus_t CUSPARSEAPI
cusparseSpGEMMreuse_copy(cusparseHandle_t      handle,
                         cusparseOperation_t   opA,
                         cusparseOperation_t   opB,
                         cusparseSpMatDescr_t  matA,  // non-const descriptor supported
                         cusparseSpMatDescr_t  matB,  // non-const descriptor supported
                         cusparseSpMatDescr_t  matC,
                         cusparseSpGEMMAlg_t   alg,
                         cusparseSpGEMMDescr_t spgemmDescr,
                         size_t*               bufferSize5,
                         void*                 externalBuffer5)

cusparseStatus_t CUSPARSEAPI
cusparseSpGEMMreuse_compute(cusparseHandle_t      handle,
                            cusparseOperation_t   opA,
                            cusparseOperation_t   opB,
                            const void*           alpha,
                            cusparseSpMatDescr_t  matA,  // non-const descriptor supported
                            cusparseSpMatDescr_t  matB,  // non-const descriptor supported
                            const void*           beta,
                            cusparseSpMatDescr_t  matC,
                            cudaDataType          computeType,
                            cusparseSpGEMMAlg_t   alg,
                            cusparseSpGEMMDescr_t spgemmDescr)

This function performs the multiplication of two sparse matrices matA and matB where the structure of the output matrix matC can be reused for multiple computations with different values.

\(\mathbf{C^{\prime}} = \alpha op\left( \mathbf{A} \right) \cdot op\left( \mathbf{B} \right) + \beta\mathbf{C}\)

where \(\alpha\) and \(\beta\) are scalars.

The functions cusparseSpGEMMreuse_workEstimation(), cusparseSpGEMMreuse_nnz(), and cusparseSpGEMMreuse_copy() are used for determining the buffer size and performing the actual computation.

Note: cusparseSpGEMMreuse() output CSR matrix (matC) is sorted by column indices.

MEMORY REQUIREMENT: cusparseSpGEMMreuse requires to keep in memory all intermediate products to reuse the structure of the output matrix. On the other hand, the number of intermediate products is orders of magnitude higher than the number of non-zero entries in general. In order to minimize the memory requirements, the routine uses multiple buffers that can be deallocated after they are no more needed. If the number of intermediate product exceeds 2^31-1, the routine will returns CUSPARSE_STATUS_INSUFFICIENT_RESOURCES status.

Currently, the function has the following limitations:

  • Only 32-bit indices CUSPARSE_INDEX_32I is supported

  • Only CSR format CUSPARSE_FORMAT_CSR is supported

  • Only opA, opB equal to CUSPARSE_OPERATION_NON_TRANSPOSE are supported

The data types combinations currently supported for cusparseSpGEMMreuse are listed below.

Uniform-precision computation:

A/B/ C/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation: [DEPRECATED]

A/B

C

computeType

CUDA_R_16F

CUDA_R_16F

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_32F

cusparseSpGEMMreuse routine runs for the following algorithm:

Algorithm

Notes

CUSPARSE_SPGEMM_DEFAULT

CUSPARSE_SPGEMM_CSR_ALG_NONDETERMINITIC

Default algorithm. Provides deterministic (bit-wise) structure for the output matrix for each run, while value computation is not deterministic.

CUSPARSE_SPGEMM_CSR_ALG_DETERMINITIC

Provides deterministic (bit-wise) structure for the output matrix and value computation for each run.

cusparseSpGEMMreuse() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine allows the indices of matA and matB to be unsorted

  • The routine guarantees the indices of matC to be sorted

cusparseSpGEMMreuse() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

Refer to cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseSpGEMMreuse for a code example.


cusparseSparseToDense()

cusparseStatus_t
cusparseSparseToDense_bufferSize(cusparseHandle_t           handle,
                                 cusparseConstSpMatDescr_t  matA,  // non-const descriptor supported
                                 cusparseDnMatDescr_t       matB,
                                 cusparseSparseToDenseAlg_t alg,
                                 size_t*                    bufferSize)
cusparseStatus_t
cusparseSparseToDense(cusparseHandle_t           handle,
                      cusparseConstSpMatDescr_t  matA,  // non-const descriptor supported
                      cusparseDnMatDescr_t       matB,
                      cusparseSparseToDenseAlg_t alg,
                      void*                      buffer)

The function converts the sparse matrix matA in CSR, CSC, or COO format into its dense representation matB. Blocked-ELL is not currently supported.

The function cusparseSparseToDense_bufferSize() returns the size of the workspace needed by cusparseSparseToDense().

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

matA

HOST

IN

Sparse matrix A

matB

HOST

OUT

Dense matrix B

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed by cusparseSparseToDense()

buffer

DEVICE

IN

Pointer to workspace buffer

cusparseSparseToDense() supports the following index type for representing the sparse matrix matA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSparseToDense() supports the following data types:

A/B

CUDA_R_8I

CUDA_R_16F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

cusparseSparse2Dense() supports the following algorithm:

Algorithm

Notes

CUSPARSE_SPARSETODENSE_ALG_DEFAULT

Default algorithm

cusparseSparseToDense() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run

  • The routine allows the indices of matA to be unsorted

cusparseSparseToDense() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseSparseToDense for a code example.


cusparseDenseToSparse()

cusparseStatus_t
cusparseDenseToSparse_bufferSize(cusparseHandle_t           handle,
                                 cusparseConstDnMatDescr_t  matA,  // non-const descriptor supported
                                 cusparseSpMatDescr_t       matB,
                                 cusparseDenseToSparseAlg_t alg,
                                 size_t*                    bufferSize)
cusparseStatus_t
cusparseDenseToSparse_analysis(cusparseHandle_t           handle,
                               cusparseConstDnMatDescr_t  matA,  // non-const descriptor supported
                               cusparseSpMatDescr_t       matB,
                               cusparseDenseToSparseAlg_t alg,
                               void*                      buffer)
cusparseStatus_t
cusparseDenseToSparse_convert(cusparseHandle_t           handle,
                              cusparseConstDnMatDescr_t  matA,  // non-const descriptor supported
                              cusparseSpMatDescr_t       matB,
                              cusparseDenseToSparseAlg_t alg,
                              void*                      buffer)

The function converts the dense matrix matA into a sparse matrix matB in CSR, CSC, COO, or Blocked-ELL format.

The function cusparseDenseToSparse_bufferSize() returns the size of the workspace needed by cusparseDenseToSparse_analysis().

The function cusparseDenseToSparse_analysis() updates the number of non-zero elements in the sparse matrix descriptor matB. The user is responsible to allocate the memory required by the sparse matrix:

  • Row/Column indices and value arrays for CSC and CSR respectively

  • Row, column, value arrays for COO

  • Column (ellColInd), value (ellValue) arrays for Blocked-ELL

Finally, we call cusparseDenseToSparse_convert() for filling the arrays allocated in the previous step.

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

matA

HOST

IN

Dense matrix A

matB

HOST

OUT

Sparse matrix B

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed by cusparseDenseToSparse_analysis()

buffer

DEVICE

IN

Pointer to workspace buffer

cusparseDenseToSparse() supports the following index type for representing the sparse vector matB:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseDenseToSparse() supports the following data types:

A/B

CUDA_R_*8I

CUDA_R_16F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

cusparseDense2Sparse() supports the following algorithm:

Algorithm

Notes

CUSPARSE_DENSETOSPARSE_ALG_DEFAULT

Default algorithm

cusparseDenseToSparse() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run

  • The routine does not guarantee the indices of matB to be sorted

cusparseDenseToSparse() supports the following optimizations:

  • CUDA graph capture

  • Hardware Memory Compression

See cusparseStatus_t for the description of the return status.

Please visit cuSPARSE Library Samples - cusparseDenseToSparse (CSR) and cuSPARSE Library Samples - cusparseDenseToSparse (Blocked-ELL) for code examples.