Generic API Functions
cusparseAxpby()
cusparseStatus_t
cusparseAxpby(cusparseHandle_t handle,
const void* alpha,
cusparseConstSpVecDescr_t vecX, // non-const descriptor supported
const void* beta,
cusparseDnVecDescr_t vecY)
The function computes the sum of a sparse vector vecX
and a dense vector vecY
.
\(\mathbf{Y} = \alpha\mathbf{X} + \beta\mathbf{Y}\) |
In other words,
for i=0 to n-1
Y[i] = beta * Y[i]
for i=0 to nnz-1
Y[X_indices[i]] += alpha * X_values[i]
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST or DEVICE |
IN |
\(\alpha\) scalar used for multiplication of compute type |
|
HOST |
IN |
Sparse vector |
|
HOST or DEVICE |
IN |
\(\beta\) scalar used for multiplication of compute type |
|
HOST |
IN/OUT |
Dense vector |
cusparseAxpby
supports the following index type for representing the sparse vector vecX
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
cusparseAxpby
supports the following data types:
Uniform-precision computation:
|
---|
|
|
|
|
Mixed-precision computation:
|
|
|
---|---|---|
|
|
|
|
||
|
|
[DEPRECATED] |
|
[DEPRECATED] |
cusparseAxpby()
has the following constraints:
The arrays representing the sparse vector
vecX
must be aligned to 16 bytes
cusparseAxpby()
has the following properties:
The routine requires no extra storage
The routine supports asynchronous execution
Provides deterministic (bit-wise) results for each run if the the sparse vector
vecX
indices are distinctThe routine allows
indices
ofvecX
to be unsorted
cusparseAxpby()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseAxpby for a code example.
cusparseGather()
cusparseStatus_t
cusparseGather(cusparseHandle_t handle,
cusparseConstDnVecDescr_t vecY, // non-const descriptor supported
cusparseSpVecDescr_t vecX)
The function gathers the elements of the dense vector vecY
into the sparse vector vecX
In other words,
for i=0 to nnz-1
X_values[i] = Y[X_indices[i]]
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
OUT |
Sparse vector |
|
HOST |
IN |
Dense vector |
cusparseGather
supports the following index type for representing the sparse vector vecX
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
cusparseGather
supports the following data types:
|
---|
|
|
|
|
|
|
|
|
cusparseGather()
has the following constraints:
The arrays representing the sparse vector
vecX
must be aligned to 16 bytes
cusparseGather()
has the following properties:
The routine requires no extra storage
The routine supports asynchronous execution
Provides deterministic (bit-wise) results for each run if the the sparse vector
vecX
indices are distinctThe routine allows
indices
ofvecX
to be unsorted
cusparseGather()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseGather for a code example.
cusparseScatter()
cusparseStatus_t
cusparseScatter(cusparseHandle_t handle,
cusparseConstSpVecDescr_t vecX, // non-const descriptor supported
cusparseDnVecDescr_t vecY)
The function scatters the elements of the sparse vector vecX
into the dense vector vecY
In other words,
for i=0 to nnz-1
Y[X_indices[i]] = X_values[i]
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
IN |
Sparse vector |
|
HOST |
OUT |
Dense vector |
cusparseScatter
supports the following index type for representing the sparse vector vecX
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
cusparseScatter
supports the following data types:
|
---|
|
|
|
|
|
|
|
|
|
cusparseScatter()
has the following constraints:
The arrays representing the sparse vector
vecX
must be aligned to 16 bytes
cusparseScatter()
has the following properties:
The routine requires no extra storage
The routine supports asynchronous execution
Provides deterministic (bit-wise) results for each run if the the sparse vector
vecX
indices are distinctThe routine allows
indices
ofvecX
to be unsorted
cusparseScatter()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseScatter for a code example.
cusparseRot() [DEPRECATED]
> The routine will be removed in the next major release
cusparseStatus_t
cusparseRot(cusparseHandle_t handle,
const void* c_coeff,
const void* s_coeff,
cusparseSpVecDescr_t vecX,
cusparseDnVecDescr_t vecY)
The function computes the Givens rotation matrix
\(G = \begin{bmatrix} c & s \\ {- s} & c \\ \end{bmatrix}\) |
to a sparse vecX
and a dense vector vecY
In other words,
for i=0 to nnz-1
Y[X_indices[i]] = c * Y[X_indices[i]] - s * X_values[i]
X_values[i] = c * X_values[i] + s * Y[X_indices[i]]
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST or DEVICE |
IN |
cosine element of the rotation matrix |
|
HOST |
IN/OUT |
Sparse vector |
|
HOST or DEVICE |
IN |
sine element of the rotation matrix |
|
HOST |
IN/OUT |
Dense vector |
cusparseRot
supports the following index type for representing the sparse vector vecX
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
cusparseRot
supports the following data types:
Uniform-precision computation:
|
---|
|
|
|
|
Mixed-precision computation:
|
|
|
---|---|---|
|
|
|
|
||
|
|
[DEPRECATED] |
|
[DEPRECATED] |
cusparseRot()
has the following constraints:
The arrays representing the sparse vector
vecX
must be aligned to 16 bytes
cusparseRot()
has the following properties:
The routine requires no extra storage
The routine supports asynchronous execution
Provides deterministic (bit-wise) results for each run if the the sparse vector
vecX
indices are distinct
cusparseRot()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseRot for a code example.
cusparseSpVV()
cusparseStatus_t
cusparseSpVV_bufferSize(cusparseHandle_t handle,
cusparseOperation_t opX,
cusparseConstSpVecDescr_t vecX, // non-const descriptor supported
cusparseConstDnVecDescr_t vecY, // non-const descriptor supported
void* result,
cudaDataType computeType,
size_t* bufferSize)
cusparseStatus_t
cusparseSpVV(cusparseHandle_t handle,
cusparseOperation_t opX,
cusparseConstSpVecDescr_t vecX, // non-const descriptor supported
cusparseConstDnVecDescr_t vecY, // non-const descriptor supported
void* result,
cudaDataType computeType,
void* externalBuffer)
The function computes the inner dot product of a sparse vector vecX
and a dense vector vecY
\(result = op\left(\mathbf{X}\right) \cdot \mathbf{Y}\) |
In other words,
result = 0;
for i=0 to nnz-1
result += op(X_values[i]) * Y[X_indices[i]]
\(\text{op}(X) = \begin{cases} X & \text{if op(X) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ \overline{X} & \text{if op(X) == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)
The function cusparseSpVV_bufferSize()
returns the size of the workspace needed by cusparseSpVV()
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
IN |
Operation |
|
HOST |
IN |
Sparse vector |
|
HOST |
IN |
Dense vector |
|
HOST or DEVICE |
OUT |
The resulting dot product |
|
HOST |
IN |
Datatype in which the computation is executed |
|
HOST |
OUT |
Number of bytes of workspace needed by |
|
DEVICE |
IN |
Pointer to a workspace buffer of at least |
cusparseSpVV
supports the following index type for representing the sparse vector vecX
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
The data types combinations currently supported for cusparseSpVV
are listed below:
Uniform-precision computation:
|
---|
|
|
|
|
Mixed-precision computation:
|
|
Notes |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[DEPRECATED] |
|
|
[DEPRECATED] |
cusparseSpVV()
has the following constraints:
The arrays representing the sparse vector
vecX
must be aligned to 16 bytes
cusparseSpVV()
has the following properties:
The routine requires no extra storage
The routine supports asynchronous execution
Provides deterministic (bit-wise) results for each run if the the sparse vector
vecX
indices are distinctThe routine allows
indices
ofvecX
to be unsorted
cusparseSpVV()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseSpVV for a code example.
cusparseSpMV()
cusparseStatus_t
cusparseSpMV_bufferSize(cusparseHandle_t handle,
cusparseOperation_t opA,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnVecDescr_t vecX, // non-const descriptor supported
const void* beta,
cusparseDnVecDescr_t vecY,
cudaDataType computeType,
cusparseSpMVAlg_t alg,
size_t* bufferSize)
cusparseStatus_t
cusparseSpMV_preprocess(cusparseHandle_t handle,
cusparseOperation_t opA,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnVecDescr_t vecX, // non-const descriptor supported
const void* beta,
cusparseDnVecDescr_t vecY,
cudaDataType computeType,
cusparseSpMVAlg_t alg,
void* externalBuffer)
cusparseStatus_t
cusparseSpMV(cusparseHandle_t handle,
cusparseOperation_t opA,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnVecDescr_t vecX, // non-const descriptor supported
const void* beta,
cusparseDnVecDescr_t vecY,
cudaDataType computeType,
cusparseSpMVAlg_t alg,
void* externalBuffer)
This function performs the multiplication of a sparse matrix matA
and a dense vector vecX
\(\mathbf{Y} = \alpha op\left( \mathbf{A} \right) \cdot \mathbf{X} + \beta\mathbf{Y}\) |
where
op(A)
is a sparse matrix of size \(m \times k\)X
is a dense vector of size \(k\)Y
is a dense vector of size \(m\)\(\alpha\) and \(\beta\) are scalars
Also, for matrix A
\(\text{op}(A) == \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if op(A) ==CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)
The function cusparseSpMV_bufferSize()
returns the size of the workspace needed by cusparseSpMV_preprocess()
and cusparseSpMV()
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
IN |
Operation |
|
HOST or DEVICE |
IN |
\(\alpha\) scalar used for multiplication of type |
|
HOST |
IN |
Sparse matrix |
|
HOST |
IN |
Dense vector |
|
HOST or DEVICE |
IN |
\(\beta\) scalar used for multiplication of type |
|
HOST |
IN/OUT |
Dense vector |
|
HOST |
IN |
Datatype in which the computation is executed |
|
HOST |
IN |
Algorithm for the computation |
|
HOST |
OUT |
Number of bytes of workspace needed by |
|
DEVICE |
IN |
Pointer to a workspace buffer of at least |
The sparse matrix formats currently supported are listed below:
CUSPARSE_FORMAT_COO
CUSPARSE_FORMAT_CSR
CUSPARSE_FORMAT_CSC
CUSPARSE_FORMAT_SLICED_ELL
cusparseSpMV
supports the following index type for representing the sparse matrix matA
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
cusparseSpMV
supports the following data types:
Uniform-precision computation:
|
---|
|
|
|
|
Mixed-precision computation:
|
|
|
Notes |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|||
|
|||
|
|
||
|
|
||
|
|
|
|
|
|
[DEPRECATED] |
|
|
|
[DEPRECATED] |
|
|
---|---|
|
|
Mixed Regular/Complex computation:
|
|
---|---|
|
|
|
|
NOTE: CUDA_R_16F
, CUDA_R_16BF
, CUDA_C_16F
, and CUDA_C_16BF
data types always imply mixed-precision computation.
cusparseSpMV()
supports the following algorithms:
Algorithm |
Notes |
---|---|
|
Default algorithm for any sparse matrix format. |
|
Default algorithm for COO sparse matrix format. May produce slightly different results during different runs with the same input parameters. |
|
Provides deterministic (bit-wise) results for each run. If |
|
Default algorithm for CSR/CSC sparse matrix format. May produce slightly different results during different runs with the same input parameters. |
|
Provides deterministic (bit-wise) results for each run. If |
|
Default algorithm for Sliced Ellpack sparse matrix format. Provides deterministic (bit-wise) results for each run. |
Performance notes:
CUSPARSE_SPMV_COO_ALG1
andCUSPARSE_SPMV_CSR_ALG1
provide higher performance thanCUSPARSE_SPMV_COO_ALG2
andCUSPARSE_SPMV_CSR_ALG2
.In general,
opA == CUSPARSE_OPERATION_NON_TRANSPOSE
is 3x faster thanopA != CUSPARSE_OPERATION_NON_TRANSPOSE
.Using
cusparseSpMV_preprocess()
helps improve performance ofcusparseSpMV()
in CSR. It is beneficial when we need to runcusparseSpMV()
multiple times with a same matrix (cusparseSpMV_preprocess()
is executed only once).
cusparseSpMV()
has the following properties:
The routine requires extra storage for CSR/CSC format (all algorithms) and for COO format with
CUSPARSE_SPMV_COO_ALG2
algorithm.Provides deterministic (bit-wise) results for each run only for
CUSPARSE_SPMV_COO_ALG2
andCUSPARSE_SPMV_CSR_ALG2
algorithms, andopA == CUSPARSE_OPERATION_NON_TRANSPOSE
.The routine supports asynchronous execution.
compute-sanitizer could report false race conditions for this routine when
beta == 0
. This is for optimization purposes and does not affect the correctness of the computation.The routine allows the indices of
matA
to be unsorted.
cusparseSpMV()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseSpMV CSR and cusparseSpMV COO for a code example.
cusparseSpSV()
cusparseStatus_t
cusparseSpSV_createDescr(cusparseSpSVDescr_t* spsvDescr);
cusparseStatus_t
cusparseSpSV_destroyDescr(cusparseSpSVDescr_t spsvDescr);
cusparseStatus_t
cusparseSpSV_bufferSize(cusparseHandle_t handle,
cusparseOperation_t opA,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnVecDescr_t vecX, // non-const descriptor supported
cusparseDnVecDescr_t vecY,
cudaDataType computeType,
cusparseSpSVAlg_t alg,
cusparseSpSVDescr_t spsvDescr,
size_t* bufferSize)
cusparseStatus_t
cusparseSpSV_analysis(cusparseHandle_t handle,
cusparseOperation_t opA,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnVecDescr_t vecX, // non-const descriptor supported
cusparseDnVecDescr_t vecY,
cudaDataType computeType,
cusparseSpSVAlg_t alg,
cusparseSpSVDescr_t spsvDescr
void* externalBuffer)
cusparseStatus_t
cusparseSpSV_solve(cusparseHandle_t handle,
cusparseOperation_t opA,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnVecDescr_t vecX, // non-const descriptor supported
cusparseDnVecDescr_t vecY,
cudaDataType computeType,
cusparseSpSVAlg_t alg,
cusparseSpSVDescr_t spsvDescr)
cusparseStatus_t
cusparseSpSV_updateMatrix(cusparseHandle_t handle,
cusparseSpSVDescr_t spsvDescr,
void* newValues,
cusparseSpSVUpdate_t updatePart)
The function solves a system of linear equations whose coefficients are represented in a sparse triangular matrix:
\(op\left( \mathbf{A} \right) \cdot \mathbf{Y} = \alpha\mathbf{X}\) |
where
op(A)
is a sparse square matrix of size \(m \times m\)X
is a dense vector of size \(m\)Y
is a dense vector of size \(m\)\(\alpha\) is a scalar
Also, for matrix A
\(\text{op}(A) = \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if op(A) == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)
The function cusparseSpSV_bufferSize()
returns the size of the workspace needed by cusparseSpSV_analysis()
and cusparseSpSV_solve()
.
The function cusparseSpSV_analysis()
performs the analysis phase, while cusparseSpSV_solve()
executes the solve phase for a sparse triangular linear system.
The opaque data structure spsvDescr
is used to share information among all functions.
The function cusparseSpSV_updateMatrix()
updates spsvDescr
with new matrix values.
The routine supports arbitrary sparsity for the input matrix, but only the upper or lower triangular part is taken into account in the computation.
NOTE: all parameters must be consistent across cusparseSpSV
API calls and the matrix descriptions and externalBuffer
must not be modified between cusparseSpSV_analysis()
and cusparseSpSV_solve()
. The function cusparseSpSV_updateMatrix()
can be used to update the values on the sparse matrix stored inside the opaque data structure spsvDescr
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
IN |
Operation |
|
HOST or DEVICE |
IN |
\(\alpha\) scalar used for multiplication of type |
|
HOST |
IN |
Sparse matrix |
|
HOST |
IN |
Dense vector |
|
HOST |
IN/OUT |
Dense vector |
|
HOST |
IN |
Datatype in which the computation is executed |
|
HOST |
IN |
Algorithm for the computation |
|
HOST |
OUT |
Number of bytes of workspace needed by |
|
DEVICE |
IN/OUT |
Pointer to a workspace buffer of at least |
|
HOST |
IN/OUT |
Opaque descriptor for storing internal data used across the three steps |
The sparse matrix formats currently supported are listed below:
CUSPARSE_FORMAT_CSR
CUSPARSE_FORMAT_COO
CUSPARSE_FORMAT_SLICED_ELL
The cusparseSpSV()
supports the following shapes and properties:
CUSPARSE_FILL_MODE_LOWER
andCUSPARSE_FILL_MODE_UPPER
fill modesCUSPARSE_DIAG_TYPE_NON_UNIT
andCUSPARSE_DIAG_TYPE_UNIT
diagonal types
The fill mode and diagonal type can be set by cusparseSpMatSetAttribute()
cusparseSpSV()
supports the following index type for representing the sparse matrix matA
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
cusparseSpSV()
supports the following data types:
Uniform-precision computation:
|
---|
|
|
|
|
cusparseSpSV()
supports the following algorithms:
Algorithm |
Notes |
---|---|
|
Default algorithm |
cusparseSpSV()
has the following properties:
The routine requires extra storage for the analysis phase which is proportional to number of non-zero entries of the sparse matrix
Provides deterministic (bit-wise) results for each run for the solving phase
cusparseSpSV_solve()
The routine supports in-place operation
The
cusparseSpSV_solve()
routine supports asynchronous executioncusparseSpSV_bufferSize()
andcusparseSpSV_analysis()
routines acceptNULL
forvecX
andvecY
The routine allows the indices of
matA
to be unsorted
cusparseSpSV()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
cusparseSpSV_updateMatrix()
updates the sparse matrix after calling the analysis phase. This functions supports the following update strategies (updatePart
):
Strategy |
Notes |
---|---|
|
Updates the sparse matrix values with values of |
|
Updates the diagonal part of the matrix with diagonal values stored in |
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseSpSV CSR and cuSPARSE Library Samples - cusparseSpSV COO for code examples.
cusparseSpMM()
cusparseStatus_t
cusparseSpMM_bufferSize(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnMatDescr_t matB, // non-const descriptor supported
const void* beta,
cusparseDnMatDescr_t matC,
cudaDataType computeType,
cusparseSpMMAlg_t alg,
size_t* bufferSize)
cusparseStatus_t
cusparseSpMM_preprocess(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnMatDescr_t matB, // non-const descriptor supported
const void* beta,
cusparseDnMatDescr_t matC,
cudaDataType computeType,
cusparseSpMMAlg_t alg,
void* externalBuffer)
cusparseStatus_t
cusparseSpMM(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnMatDescr_t matB, // non-const descriptor supported
const void* beta,
cusparseDnMatDescr_t matC,
cudaDataType computeType,
cusparseSpMMAlg_t alg,
void* externalBuffer)
The function performs the multiplication of a sparse matrix matA
and a dense matrix matB
.
\(\mathbf{C} = \alpha op\left( \mathbf{A} \right) \cdot op\left( \mathbf{B} \right) + \beta\mathbf{C}\) |
where
op(A)
is a sparse matrix of size \(m \times k\)op(B)
is a dense matrix of size \(k \times n\)C
is a dense matrix of size \(m \times n\)\(\alpha\) and \(\beta\) are scalars
The routine can be also used to perform the multiplication of a dense matrix and a sparse matrix by switching the dense matrices layout:
\(\begin{array}{l} \left. \mathbf{C}_{C} = \mathbf{B}_{C} \cdot \mathbf{A} + \beta\mathbf{C}_{C}\rightarrow \right. \\ {\mathbf{C}_{R} = \mathbf{A}^{T} \cdot \mathbf{B}_{R} + \beta\mathbf{C}_{R}} \\ \end{array}\) |
where \(\mathbf{B}_{C}\) , \(\mathbf{C}_{C}\) indicate column-major layout, while \(\mathbf{B}_{R}\) , \(\mathbf{C}_{R}\) refer to row-major layout
Also, for matrix A
and B
\(\text{op}(A) = \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if op(A) == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)
\(\text{op}(B) = \begin{cases} B & \text{if op(B) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ B^{T} & \text{if op(B) == CUSPARSE_OPERATION_TRANSPOSE} \\ B^{H} & \text{if op(B) == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)
When using the (conjugate) transpose of the sparse matrix A
, this routine may produce slightly different results during different runs with the same input parameters.
The function cusparseSpMM_bufferSize()
returns the size of the workspace needed by cusparseSpMM()
The function cusparseSpMM_preprocess()
can be called before cusparseSpMM
to speedup the actual computation. It is useful when cusparseSpMM
is called multiple times with the same sparsity pattern (matA
). The values of the matrices (matA
, matB
, matC
) can change arbitrarily. It provides performance advantages is used with CUSPARSE_SPMM_CSR_ALG1
or CUSPARSE_SPMM_CSR_ALG3
. For all other formats and algorithms have no effect.
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
IN |
Operation |
|
HOST |
IN |
Operation |
|
HOST or DEVICE |
IN |
\(\alpha\) scalar used for multiplication of type |
|
HOST |
IN |
Sparse matrix |
|
HOST |
IN |
Dense matrix |
|
HOST or DEVICE |
IN |
\(\beta\) scalar used for multiplication of type |
|
HOST |
IN/OUT |
Dense matrix |
|
HOST |
IN |
Datatype in which the computation is executed |
|
HOST |
IN |
Algorithm for the computation |
|
HOST |
OUT |
Number of bytes of workspace needed by |
|
DEVICE |
IN |
Pointer to workspace buffer of at least |
cusparseSpMM
supports the following sparse matrix formats:
CUSPARSE_FORMAT_COO
CUSPARSE_FORMAT_CSR
CUSPARSE_FORMAT_CSC
CUSPARSE_FORMAT_BLOCKED_ELL
(1) |
COO/CSR/CSC FORMATS |
cusparseSpMM
supports the following index type for representing the sparse matrix matA
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
cusparseSpMM
supports the following data types:
Uniform-precision computation:
|
---|
|
|
|
|
Mixed-precision computation:
|
|
|
|
---|---|---|---|
|
|
|
|
|
|
|
|
|
|||
|
|||
|
|
||
|
|
||
|
|
|
[DEPRECATED] |
|
|
[DEPRECATED] |
NOTE: CUDA_R_16F
, CUDA_R_16BF
, CUDA_C_16F
, and CUDA_C_16BF
data types always imply mixed-precision computation.
cusparseSpMM
supports the following algorithms:
Algorithm |
Notes |
---|---|
|
Default algorithm for any sparse matrix format |
|
Algorithm 1 for COO sparse matrix format
|
|
Algorithm 2 for COO sparse matrix format
|
|
Algorithm 3 for COO sparse matrix format
|
|
Algorithm 4 for COO sparse matrix format
|
|
Algorithm 1 for CSR/CSC sparse matrix format
|
|
Algorithm 2 for CSR/CSC sparse matrix format
|
|
Algorithm 3 for CSR/CSC sparse matrix format
|
Performance notes:
Row-major layout provides higher performance than column-major
CUSPARSE_SPMM_COO_ALG4
andCUSPARSE_SPMM_CSR_ALG2
should be used with row-major layout, whileCUSPARSE_SPMM_COO_ALG1
,CUSPARSE_SPMM_COO_ALG2
,CUSPARSE_SPMM_COO_ALG3
, andCUSPARSE_SPMM_CSR_ALG1
with column-major layoutFor
beta != 1
, the output matrix is scaled before the actual computationFor
n == 1
, the routine usescusparseSpMV()
as fallback
cusparseSpMM()
with all algorithms support the following batch modes except for CUSPARSE_SPMM_CSR_ALG3
:
\(C_{i} = A \cdot B_{i}\)
\(C_{i} = A_{i} \cdot B\)
\(C_{i} = A_{i} \cdot B_{i}\)
The number of batches and their strides can be set by using cusparseCooSetStridedBatch
, cusparseCsrSetStridedBatch
, and cusparseDnMatSetStridedBatch
. The maximum number of batches for cusparseSpMM()
is 65,535.
cusparseSpMM()
has the following properties:
The routine requires no extra storage for
CUSPARSE_SPMM_COO_ALG1
,CUSPARSE_SPMM_COO_ALG3
,CUSPARSE_SPMM_COO_ALG4
The routine supports asynchronous execution
Provides deterministic (bit-wise) results for each run only for
CUSPARSE_SPMM_COO_ALG2
andCUSPARSE_SPMM_CSR_ALG3
algorithmscompute-sanitizer
could report false race conditions for this routine. This is for optimization purposes and does not affect the correctness of the computationThe routine allows the indices of
matA
to be unsorted
cusparseSpMM()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
Please visit cuSPARSE Library Samples - cusparseSpMM CSR and cusparseSpMM COO for a code example. For batched computation please visit cusparseSpMM CSR Batched and cusparseSpMM COO Batched.
(2) |
BLOCKED-ELLPACK FORMAT |
cusparseSpMM
supports the following data types for CUSPARSE_FORMAT_BLOCKED_ELL
format and the following GPU architectures for exploiting NVIDIA Tensor Cores:
|
|
|
|
|
---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cusparseSpMM
supports the following algorithms with CUSPARSE_FORMAT_BLOCKED_ELL
format:
Algorithm |
Notes |
---|---|
|
Default algorithm for any sparse matrix format |
|
Default algorithm for Blocked-ELL format |
Performance notes:
Blocked-ELL SpMM provides the best performance with Power-of-2 Block-Sizes.
Large Block-Sizes (e.g. ≥ 64) provide the best performance.
The function has the following limitations:
The pointer mode must be equal to
CUSPARSE_POINTER_MODE_HOST
Only
opA == CUSPARSE_OPERATION_NON_TRANSPOSE
is supported.opB == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE
is not supported.Only
CUSPARSE_INDEX_32I
is supported.
Please visit cuSPARSE Library Samples - cusparseSpMM Blocked-ELL for a code example.
See cusparseStatus_t for the description of the return status.
cusparseSpMMOp()
cusparseStatus_t CUSPARSEAPI
cusparseSpMMOp_createPlan(cusparseHandle_t handle,
cusparseSpMMOpPlan_t* plan,
cusparseOperation_t opA,
cusparseOperation_t opB,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnMatDescr_t matB, // non-const descriptor supported
cusparseDnMatDescr_t matC,
cudaDataType computeType,
cusparseSpMMOpAlg_t alg,
const void* addOperationNvvmBuffer,
size_t addOperationBufferSize,
const void* mulOperationNvvmBuffer,
size_t mulOperationBufferSize,
const void* epilogueNvvmBuffer,
size_t epilogueBufferSize,
size_t* SpMMWorkspaceSize)
cusparseStatus_t
cusparseSpMMOp_destroyPlan(cusparseSpMMOpPlan_t plan)
cusparseStatus_t
cusparseSpMMOp(cusparseSpMMOpPlan_t plan,
void* externalBuffer)
NOTE 1: NVRTC and nvJitLink are not currently available on Arm64 Android platforms.
NOTE 2: The routine does not support Android and Tegra platforms except Judy (sm87).
Experimental: The function performs the multiplication of a sparse matrix matA
and a dense matrix matB
with custom operators.
\({C^{\prime}}_{ij} = \text{epilogue}\left( {\sum_{k}^{\oplus}{op\left( A_{ik} \right) \otimes op\left( B_{kj} \right),C_{ij}}} \right)\) |
where
op(A)
is a sparse matrix of size \(m \times k\)op(B)
is a dense matrix of size \(k \times n\)C
is a dense matrix of size \(m \times n\)\(\oplus\) , \(\otimes\) , and \(\text{epilogue}\) are custom add, mul, and epilogue operators respectively.
Also, for matrix A
and B
\(\text{op}(A) = \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ \end{cases}\)
\(\text{op}(B) = \begin{cases} B & {\text{if op(}B\text{) == CUSPARSE_OPERATION_NON_TRANSPOSE}} \\ B^{T} & {\text{if op(}B\text{) == CUSPARSE_OPERATION_TRANSPOSE}} \\ \end{cases}\)
Only opA == CUSPARSE_OPERATION_NON_TRANSPOSE
is currently supported
The function cusparseSpMMOp_createPlan()
returns the size of the workspace and the compiled kernel needed by cusparseSpMMOp()
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
IN |
Operation |
|
HOST |
IN |
Operation |
|
HOST |
IN |
Sparse matrix |
|
HOST |
IN |
Dense matrix |
|
HOST |
IN/OUT |
Dense matrix |
|
HOST |
IN |
Datatype in which the computation is executed |
|
HOST |
IN |
Algorithm for the computation |
|
HOST |
IN |
Pointer to the NVVM buffer containing the custom add operator |
|
HOST |
IN |
Size in bytes of |
|
HOST |
IN |
Pointer to the NVVM buffer containing the custom mul operator |
|
HOST |
IN |
Size in bytes of |
|
HOST |
IN |
Pointer to the NVVM buffer containing the custom epilogue operator |
|
HOST |
IN |
Size in bytes of |
|
HOST |
OUT |
Number of bytes of workspace needed by |
The operators must have the following signature and return type
__device__ <computetype> add_op(<computetype> value1, <computetype> value2);
__device__ <computetype> mul_op(<computetype> value1, <computetype> value2);
__device__ <computetype> epilogue(<computetype> value1, <computetype> value2);
<computetype>
is one of float
, double
, cuComplex
, cuDoubleComplex
, or int
,
cusparseSpMMOp
supports the following sparse matrix formats:
CUSPARSE_FORMAT_CSR
cusparseSpMMOp
supports the following index type for representing the sparse matrix matA
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
cusparseSpMMOp
supports the following data types:
Uniform-precision computation:
|
---|
|
|
|
|
Mixed-precision computation:
|
|
|
---|---|---|
|
|
|
|
|
|
|
||
|
||
|
|
|
|
|
cusparseSpMMOp
supports the following algorithms:
Algorithm |
Notes |
---|---|
|
Default algorithm for any sparse matrix format |
Performance notes:
Row-major layout provides higher performance than column-major.
cusparseSpMMOp()
has the following properties:
The routine requires extra storage
The routine supports asynchronous execution
Provides deterministic (bit-wise) results for each run
The routine allows the indices of
matA
to be unsorted
cusparseSpMMOp()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
Please visit cuSPARSE Library Samples - cusparseSpMMOp
See cusparseStatus_t for the description of the return status.
cusparseSpSM()
cusparseStatus_t
cusparseSpSM_createDescr(cusparseSpSMDescr_t* spsmDescr);
cusparseStatus_t
cusparseSpSM_destroyDescr(cusparseSpSMDescr_t spsmDescr);
cusparseStatus_t
cusparseSpSM_bufferSize(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnMatDescr_t matB, // non-const descriptor supported
cusparseDnMatDescr_t matC,
cudaDataType computeType,
cusparseSpSMAlg_t alg,
cusparseSpSMDescr_t spsmDescr,
size_t* bufferSize)
cusparseStatus_t
cusparseSpSM_analysis(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnMatDescr_t matB, // non-const descriptor supported
cusparseDnMatDescr_t matC,
cudaDataType computeType,
cusparseSpSMAlg_t alg,
cusparseSpSMDescr_t spsmDescr,
void* externalBuffer)
cusparseStatus_t
cusparseSpSM_solve(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstDnMatDescr_t matB, // non-const descriptor supported
cusparseDnMatDescr_t matC,
cudaDataType computeType,
cusparseSpSMAlg_t alg,
cusparseSpSMDescr_t spsmDescr)
cusparseStatus_t
cusparseSpSM_updateMatrix(cusparseHandle_t handle,
cusparseSpSMDescr_t spsmDescr,
void* newValues,
cusparseSpSMUpdate_t updatePart)
The function solves a system of linear equations whose coefficients are represented in a sparse triangular matrix:
\(op\left( \mathbf{A} \right) \cdot \mathbf{C} = \mathbf{\alpha}op\left( \mathbf{B} \right)\) |
where
op(A)
is a sparse square matrix of size \(m \times m\)op(B)
is a dense matrix of size \(m \times n\)C
is a dense matrix of size \(m \times n\)\(\alpha\) is a scalar
Also, for matrix A
\(\text{op}(A) = \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ A^{H} & \text{if op(A) == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE} \\ \end{cases}\)
\(\text{op}(B) = \begin{cases} B & \text{if op(B) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ B^{T} & \text{if op(B) == CUSPARSE_OPERATION_TRANSPOSE} \\ & \text{ } \\ \end{cases}\)
The function cusparseSpSM_bufferSize()
returns the size of the workspace needed by cusparseSpSM_analysis()
and cusparseSpSM_solve()
.
The function cusparseSpSM_analysis()
performs the analysis phase, while cusparseSpSM_solve()
executes the solve phase for a sparse triangular linear system.
The opaque data structure spsmDescr
is used to share information among all functions.
The function cusparseSpSM_updateMatrix()
updates spsmDescr
with new matrix values.
The routine supports arbitrary sparsity for the input matrix, but only the upper or lower triangular part is taken into account in the computation.
cusparseSpSM_bufferSize()
requires a buffer size for the analysis phase which is proportional to number of non-zero entries of the sparse matrix
The externalBuffer
is stored into spsmDescr
and used by cusparseSpSM_solve()
. For this reason, the device memory buffer must be deallocated only after cusparseSpSM_solve()
NOTE: all parameters must be consistent across cusparseSpSM
API calls and the matrix descriptions and externalBuffer
must not be modified between cusparseSpSM_analysis()
and cusparseSpSM_solve()
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
IN |
Operation |
|
HOST |
IN |
Operation |
|
HOST or DEVICE |
IN |
\(\alpha\) scalar used for multiplication of type |
|
HOST |
IN |
Sparse matrix |
|
HOST |
IN |
Dense matrix |
|
HOST |
IN/OUT |
Dense matrix |
|
HOST |
IN |
Datatype in which the computation is executed |
|
HOST |
IN |
Algorithm for the computation |
|
HOST |
OUT |
Number of bytes of workspace needed by |
|
DEVICE |
IN/OUT |
Pointer to a workspace buffer of at least |
|
HOST |
IN/OUT |
Opaque descriptor for storing internal data used across the three steps |
The sparse matrix formats currently supported are listed below:
CUSPARSE_FORMAT_CSR
CUSPARSE_FORMAT_COO
The cusparseSpSM()
supports the following shapes and properties:
CUSPARSE_FILL_MODE_LOWER
andCUSPARSE_FILL_MODE_UPPER
fill modesCUSPARSE_DIAG_TYPE_NON_UNIT
andCUSPARSE_DIAG_TYPE_UNIT
diagonal types
The fill mode and diagonal type can be set by cusparseSpMatSetAttribute()
cusparseSpSM()
supports the following index type for representing the sparse matrix matA
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
cusparseSpSM()
supports the following data types:
Uniform-precision computation:
|
---|
|
|
|
|
cusparseSpSM()
supports the following algorithms:
Algorithm |
Notes |
---|---|
|
Default algorithm |
cusparseSpSM()
has the following properties:
The routine requires no extra storage
Provides deterministic (bit-wise) results for each run for the solving phase
cusparseSpSM_solve()
The
cusparseSpSM_solve()
routine supports asynchronous executionThe routine supports in-place operation. The same device pointer must be provided to the
values
parameter of the dense matricesmatB
andmatC
. All other dense matrix descriptor parameters (e.g.,order
) can be set independentlycusparseSpSM_bufferSize()
andcusparseSpSM_analysis()
routines accept descriptors ofNULL
values formatB
andmatC
. These two routines do not acceptNULL
descriptorsThe routine allows the indices of
matA
to be unsorted
cusparseSpSM()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
cusparseSpSM_updateMatrix()
updates the sparse matrix after calling the analysis phase. This functions supports the following update strategies (updatePart
):
Strategy |
Notes |
---|---|
|
Updates the sparse matrix values with values of |
|
Updates the diagonal part of the matrix with diagonal values stored in |
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseSpSM CSR and cuSPARSE Library Samples - cusparseSpSM COO for code examples.
cusparseSDDMM()
cusparseStatus_t
cusparseSDDMM_bufferSize(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstDnMatDescr_t matA, // non-const descriptor supported
cusparseConstDnMatDescr_t matB, // non-const descriptor supported
const void* beta,
cusparseSpMatDescr_t matC,
cudaDataType computeType,
cusparseSDDMMAlg_t alg,
size_t* bufferSize)
cusparseStatus_t
cusparseSDDMM_preprocess(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstDnMatDescr_t matA, // non-const descriptor supported
cusparseConstDnMatDescr_t matB, // non-const descriptor supported
const void* beta,
cusparseSpMatDescr_t matC,
cudaDataType computeType,
cusparseSDDMMAlg_t alg,
void* externalBuffer)
cusparseStatus_t
cusparseSDDMM(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstDnMatDescr_t matA, // non-const descriptor supported
cusparseConstDnMatDescr_t matB, // non-const descriptor supported
const void* beta,
cusparseSpMatDescr_t matC,
cudaDataType computeType,
cusparseSDDMMAlg_t alg,
void* externalBuffer)
This function performs the multiplication of matA
and matB
, followed by an element-wise multiplication with the sparsity pattern of matC
. Formally, it performs the following operation:
\(\mathbf{C} = \alpha({op}(\mathbf{A}) \cdot {op}(\mathbf{B})) \circ {spy}(\mathbf{C}) + \beta\mathbf{C}\) |
where
op(A)
is a dense matrix of size \(m \times k\)op(B)
is a dense matrix of size \(k \times n\)C
is a sparse matrix of size \(m \times n\)\(\alpha\) and \(\beta\) are scalars
-
\(\circ\) denotes the Hadamard (entry-wise) matrix product, and \({spy}\left( \mathbf{C} \right)\) is the structural sparsity pattern matrix of
C
defined as:\({spy}(\mathbf{C})_{ij} = \begin{cases} 1 & {\text{if}\,\mathbf{C}_{ij}\,\text{is an entry stored in}\,\texttt{matC}} \\ 0 & \text{otherwise} \\ \end{cases}.\)
Also, for matrix A
and B
\(\text{op}(A) = \begin{cases} A & \text{if op(A) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ A^{T} & \text{if op(A) == CUSPARSE_OPERATION_TRANSPOSE} \\ \end{cases}\)
\(\text{op}(B) = \begin{cases} B & \text{if op(B) == CUSPARSE_OPERATION_NON_TRANSPOSE} \\ B^{T} & \text{if op(B) == CUSPARSE_OPERATION_TRANSPOSE} \\ \end{cases}\)
The function cusparseSDDMM_bufferSize()
returns the size of the workspace needed by cusparseSDDMM
or cusparseSDDMM_preprocess
.
The function cusparseSDDMM_preprocess()
can be called before cusparseSDDMM
to speedup the actual computation. It is useful when cusparseSDDMM
is called multiple times with the same sparsity pattern (matC
). The values of the dense matrices (matA
, matB
) can change arbitrarily.
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
IN |
Operation |
|
HOST |
IN |
Operation |
|
HOST or DEVICE |
IN |
\(\alpha\) scalar used for multiplication of type |
|
HOST |
IN |
Dense matrix |
|
HOST |
IN |
Dense matrix |
|
HOST or DEVICE |
IN |
\(\beta\) scalar used for multiplication of type |
|
HOST |
IN/OUT |
Sparse matrix |
|
HOST |
IN |
Datatype in which the computation is executed |
|
HOST |
IN |
Algorithm for the computation |
|
HOST |
OUT |
Number of bytes of workspace needed by |
|
DEVICE |
IN |
Pointer to a workspace buffer of at least |
Currently supported sparse matrix formats:
CUSPARSE_FORMAT_CSR
CUSPARSE_FORMAT_BSR
cusparseSDDMM()
supports the following index type for representing the sparse matrix matA
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
The data types combinations currently supported for cusparseSDDMM
are listed below:
Uniform-precision computation:
|
---|
|
|
|
|
Mixed-precision computation:
|
|
|
---|---|---|
|
|
|
|
|
cusparseSDDMM
for CUSPARSE_FORMAT_BSR
also supports the following mixed-precision computation:
|
|
|
---|---|---|
|
|
|
|
|
NOTE: CUDA_R_16F
, CUDA_R_16BF
data types always imply mixed-precision computation.
cusparseSDDMM()
for CUSPASRE_FORMAT_BSR
supports block sizes of 2, 4, 8, 16, 32, 64 and 128.
cusparseSDDMM()
supports the following algorithms:
Algorithm |
Notes |
---|---|
|
Default algorithm. It supports batched computation. |
Performance notes: cuspaseSDDMM()
for CUSPARSE_FORMAT_CSR
provides the best performance when matA
and matB
satisfy:
-
matA
:matA
is in row-major order andopA
isCUSPARSE_OPERATION_NON_TRANSPOSE
, ormatA
is in col-major order andopA
is notCUSPARSE_OPERATION_NON_TRANSPOSE
-
matB
:matB
is in col-major order andopB
isCUSPARSE_OPERATION_NON_TRANSPOSE
, ormatB
is in row-major order andopB
is notCUSPARSE_OPERATION_NON_TRANSPOSE
cuspaseSDDMM()
for CUSPARSE_FORMAT_BSR
provides the best performance when matA
and matB
satisfy:
-
matA
:matA
is in row-major order andopA
isCUSPARSE_OPERATION_NON_TRANSPOSE
, ormatA
is in col-major order andopA
is notCUSPARSE_OPERATION_NON_TRANSPOSE
-
matB
:matB
is in row-major order andopB
isCUSPARSE_OPERATION_NON_TRANSPOSE
, ormatB
is in col-major order andopB
is notCUSPARSE_OPERATION_NON_TRANSPOSE
cusparseSDDMM()
supports the following batch modes:
\(C_{i} = (A \cdot B) \circ C_{i}\)
\(C_{i} = \left( A_{i} \cdot B \right) \circ C_{i}\)
\(C_{i} = \left( A \cdot B_{i} \right) \circ C_{i}\)
\(C_{i} = \left( A_{i} \cdot B_{i} \right) \circ C_{i}\)
The number of batches and their strides can be set by using cusparseCsrSetStridedBatch
and cusparseDnMatSetStridedBatch
. The maximum number of batches for cusparseSDDMM()
is 65,535.
cusparseSDDMM()
has the following properties:
The routine requires no extra storage
Provides deterministic (bit-wise) results for each run
The routine supports asynchronous execution
The routine allows the indices of
matC
to be unsorted
cusparseSDDMM()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseSDDMM for a code example. For batched computation please visit cusparseSDDMM CSR Batched.
cusparseSpGEMM()
cusparseStatus_t
cusparseSpGEMM_createDescr(cusparseSpGEMMDescr_t* descr)
cusparseStatus_t
cusparseSpGEMM_destroyDescr(cusparseSpGEMMDescr_t descr)
cusparseStatus_t
cusparseSpGEMM_workEstimation(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstSpMatDescr_t matB, // non-const descriptor supported
const void* beta,
cusparseSpMatDescr_t matC,
cudaDataType computeType,
cusparseSpGEMMAlg_t alg,
cusparseSpGEMMDescr_t spgemmDescr,
size_t* bufferSize1,
void* externalBuffer1)
cusparseStatus_t
cusparseSpGEMM_getNumProducts(cusparseSpGEMMDescr_t spgemmDescr,
int64_t* num_prods)
cusparseStatus_t
cusparseSpGEMM_estimateMemory(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstSpMatDescr_t matB, // non-const descriptor supported
const void* beta,
cusparseSpMatDescr_t matC,
cudaDataType computeType,
cusparseSpGEMMAlg_t alg,
cusparseSpGEMMDescr_t spgemmDescr,
float chunk_fraction,
size_t* bufferSize3,
void* externalBuffer3,
size_t* bufferSize2)
cusparseStatus_t
cusparseSpGEMM_compute(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstSpMatDescr_t matB, // non-const descriptor supported
const void* beta,
cusparseSpMatDescr_t matC,
cudaDataType computeType,
cusparseSpGEMMAlg_t alg,
cusparseSpGEMMDescr_t spgemmDescr,
size_t* bufferSize2,
void* externalBuffer2)
cusparseStatus_t
cusparseSpGEMM_copy(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseConstSpMatDescr_t matB, // non-const descriptor supported
const void* beta,
cusparseSpMatDescr_t matC,
cudaDataType computeType,
cusparseSpGEMMAlg_t alg,
cusparseSpGEMMDescr_t spgemmDescr)
This function performs the multiplication of two sparse matrices matA
and matB
.
\(\mathbf{C^{\prime}} = \alpha op\left( \mathbf{A} \right) \cdot op\left( \mathbf{B} \right) + \beta\mathbf{C}\) |
where \(\alpha,\) \(\beta\) are scalars, and \(\mathbf{C},\) \(\mathbf{C^{\prime}}\) have the same sparsity pattern.
The functions cusparseSpGEMM_workEstimation()
, cusparseSpGEMM_estimateMemory()
, and cusparseSpGEMM_compute()
are used for both determining the buffer size and performing the actual computation.
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
IN |
Operation |
|
HOST |
IN |
Operation |
|
HOST or DEVICE |
IN |
\(\alpha\) scalar used for multiplication |
|
HOST |
IN |
Sparse matrix |
|
HOST |
IN |
Sparse matrix |
|
HOST or DEVICE |
IN |
\(\beta\) scalar used for multiplication |
|
HOST |
IN/OUT |
Sparse matrix |
|
HOST |
IN |
Enumerator specifying the datatype in which the computation is executed |
|
HOST |
IN |
Enumerator specifying the algorithm for the computation |
|
HOST |
IN/OUT |
Opaque descriptor for storing internal data used across the three steps |
|
HOST |
OUT |
Pointer to a 64-bit integer that stores the number of intermediate products calculated by |
|
HOST |
IN |
The fraction of total intermediate products being computed in a chunk. Used by |
|
HOST |
IN/OUT |
Number of bytes of workspace requested by |
|
HOST |
IN/OUT |
Number of bytes of workspace requested by |
|
HOST |
IN/OUT |
Number of bytes of workspace requested by |
|
DEVICE |
IN |
Pointer to workspace buffer needed by |
|
DEVICE |
IN |
Pointer to workspace buffer needed by |
|
DEVICE |
IN |
Pointer to workspace buffer needed by |
Currently, the function has the following limitations:
Only 32-bit indices
CUSPARSE_INDEX_32I
is supportedOnly CSR format
CUSPARSE_FORMAT_CSR
is supportedOnly
opA
,opB
equal toCUSPARSE_OPERATION_NON_TRANSPOSE
are supported
The data types combinations currently supported for cusparseSpGEMM
are listed below :
Uniform-precision computation:
|
---|
|
|
|
|
|
|
|
|
cusparseSpGEMM
routine runs for the following algorithms:
Algorithm |
Notes |
---|---|
|
Default algorithm. Currently, it is |
|
Algorithm 1
|
|
Algorithm 2
|
|
Algorithm 3
|
cusparseSpGEMM()
has the following properties:
The routine requires no extra storage
The routine supports asynchronous execution
The routine allows the indices of
matA
andmatB
to be unsortedThe routine guarantees the indices of
matC
to be sorted
cusparseSpGEMM()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseSpGEMM for a code example for CUSPARSE_SPGEMM_DEFAULT
and CUSPARSE_SPGEMM_ALG1
, and cuSPARSE Library Samples - memory-optimzed cusparseSpGEMM for a code example for CUSPARSE_SPGEMM_ALG2
and CUSPARSE_SPGEMM_ALG3
.
cusparseSpGEMMreuse()
cusparseStatus_t
cusparseSpGEMM_createDescr(cusparseSpGEMMDescr_t* descr)
cusparseStatus_t
cusparseSpGEMM_destroyDescr(cusparseSpGEMMDescr_t descr)
cusparseStatus_t
cusparseSpGEMMreuse_workEstimation(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
cusparseSpMatDescr_t matA, // non-const descriptor supported
cusparseSpMatDescr_t matB, // non-const descriptor supported
cusparseSpMatDescr_t matC,
cusparseSpGEMMAlg_t alg,
cusparseSpGEMMDescr_t spgemmDescr,
size_t* bufferSize1,
void* externalBuffer1)
cusparseStatus_t
cusparseSpGEMMreuse_nnz(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
cusparseSpMatDescr_t matA, // non-const descriptor supported
cusparseSpMatDescr_t matB, // non-const descriptor supported
cusparseSpMatDescr_t matC,
cusparseSpGEMMAlg_t alg,
cusparseSpGEMMDescr_t spgemmDescr,
size_t* bufferSize2,
void* externalBuffer2,
size_t* bufferSize3,
void* externalBuffer3,
size_t* bufferSize4,
void* externalBuffer4)
cusparseStatus_t CUSPARSEAPI
cusparseSpGEMMreuse_copy(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
cusparseSpMatDescr_t matA, // non-const descriptor supported
cusparseSpMatDescr_t matB, // non-const descriptor supported
cusparseSpMatDescr_t matC,
cusparseSpGEMMAlg_t alg,
cusparseSpGEMMDescr_t spgemmDescr,
size_t* bufferSize5,
void* externalBuffer5)
cusparseStatus_t CUSPARSEAPI
cusparseSpGEMMreuse_compute(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseSpMatDescr_t matA, // non-const descriptor supported
cusparseSpMatDescr_t matB, // non-const descriptor supported
const void* beta,
cusparseSpMatDescr_t matC,
cudaDataType computeType,
cusparseSpGEMMAlg_t alg,
cusparseSpGEMMDescr_t spgemmDescr)
This function performs the multiplication of two sparse matrices matA
and matB
where the structure of the output matrix matC
can be reused for multiple computations with different values.
\(\mathbf{C^{\prime}} = \alpha op\left( \mathbf{A} \right) \cdot op\left( \mathbf{B} \right) + \beta\mathbf{C}\) |
where \(\alpha\) and \(\beta\) are scalars.
The functions cusparseSpGEMMreuse_workEstimation()
, cusparseSpGEMMreuse_nnz()
, and cusparseSpGEMMreuse_copy()
are used for determining the buffer size and performing the actual computation.
Note: cusparseSpGEMMreuse()
output CSR matrix (matC
) is sorted by column indices.
MEMORY REQUIREMENT: cusparseSpGEMMreuse
requires to keep in memory all intermediate products to reuse the structure of the output matrix. On the other hand, the number of intermediate products is orders of magnitude higher than the number of non-zero entries in general. In order to minimize the memory requirements, the routine uses multiple buffers that can be deallocated after they are no more needed. If the number of intermediate product exceeds 2^31-1
, the routine will returns CUSPARSE_STATUS_INSUFFICIENT_RESOURCES
status.
Currently, the function has the following limitations:
Only 32-bit indices
CUSPARSE_INDEX_32I
is supportedOnly CSR format
CUSPARSE_FORMAT_CSR
is supportedOnly
opA
,opB
equal toCUSPARSE_OPERATION_NON_TRANSPOSE
are supported
The data types combinations currently supported for cusparseSpGEMMreuse
are listed below.
Uniform-precision computation:
|
---|
|
|
|
|
|
|
Mixed-precision computation: [DEPRECATED]
|
|
|
---|---|---|
|
|
|
|
|
|
cusparseSpGEMMreuse
routine runs for the following algorithm:
Algorithm |
Notes |
---|---|
|
Default algorithm. Provides deterministic (bit-wise) structure for the output matrix for each run, while value computation is not deterministic. |
|
Provides deterministic (bit-wise) structure for the output matrix and value computation for each run. |
cusparseSpGEMMreuse()
has the following properties:
The routine requires no extra storage
The routine supports asynchronous execution
The routine allows the indices of
matA
andmatB
to be unsortedThe routine guarantees the indices of
matC
to be sorted
cusparseSpGEMMreuse()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
Refer to cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseSpGEMMreuse for a code example.
cusparseSparseToDense()
cusparseStatus_t
cusparseSparseToDense_bufferSize(cusparseHandle_t handle,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseDnMatDescr_t matB,
cusparseSparseToDenseAlg_t alg,
size_t* bufferSize)
cusparseStatus_t
cusparseSparseToDense(cusparseHandle_t handle,
cusparseConstSpMatDescr_t matA, // non-const descriptor supported
cusparseDnMatDescr_t matB,
cusparseSparseToDenseAlg_t alg,
void* buffer)
The function converts the sparse matrix matA
in CSR, CSC, or COO format into its dense representation matB
. Blocked-ELL is not currently supported.
The function cusparseSparseToDense_bufferSize()
returns the size of the workspace needed by cusparseSparseToDense()
.
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
IN |
Sparse matrix |
|
HOST |
OUT |
Dense matrix |
|
HOST |
IN |
Algorithm for the computation |
|
HOST |
OUT |
Number of bytes of workspace needed by |
|
DEVICE |
IN |
Pointer to workspace buffer |
cusparseSparseToDense()
supports the following index type for representing the sparse matrix matA
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
cusparseSparseToDense()
supports the following data types:
|
---|
|
|
|
|
|
|
|
|
|
cusparseSparse2Dense()
supports the following algorithm:
Algorithm |
Notes |
---|---|
|
Default algorithm |
cusparseSparseToDense()
has the following properties:
The routine requires no extra storage
The routine supports asynchronous execution
Provides deterministic (bit-wise) results for each run
The routine allows the indices of
matA
to be unsorted
cusparseSparseToDense()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseSparseToDense for a code example.
cusparseDenseToSparse()
cusparseStatus_t
cusparseDenseToSparse_bufferSize(cusparseHandle_t handle,
cusparseConstDnMatDescr_t matA, // non-const descriptor supported
cusparseSpMatDescr_t matB,
cusparseDenseToSparseAlg_t alg,
size_t* bufferSize)
cusparseStatus_t
cusparseDenseToSparse_analysis(cusparseHandle_t handle,
cusparseConstDnMatDescr_t matA, // non-const descriptor supported
cusparseSpMatDescr_t matB,
cusparseDenseToSparseAlg_t alg,
void* buffer)
cusparseStatus_t
cusparseDenseToSparse_convert(cusparseHandle_t handle,
cusparseConstDnMatDescr_t matA, // non-const descriptor supported
cusparseSpMatDescr_t matB,
cusparseDenseToSparseAlg_t alg,
void* buffer)
The function converts the dense matrix matA
into a sparse matrix matB
in CSR, CSC, COO, or Blocked-ELL format.
The function cusparseDenseToSparse_bufferSize()
returns the size of the workspace needed by cusparseDenseToSparse_analysis()
.
The function cusparseDenseToSparse_analysis()
updates the number of non-zero elements in the sparse matrix descriptor matB
. The user is responsible to allocate the memory required by the sparse matrix:
Row/Column indices and value arrays for CSC and CSR respectively
Row, column, value arrays for COO
Column (
ellColInd
), value (ellValue
) arrays for Blocked-ELL
Finally, we call cusparseDenseToSparse_convert()
for filling the arrays allocated in the previous step.
Param. |
Memory |
In/out |
Meaning |
---|---|---|---|
|
HOST |
IN |
Handle to the cuSPARSE library context |
|
HOST |
IN |
Dense matrix |
|
HOST |
OUT |
Sparse matrix |
|
HOST |
IN |
Algorithm for the computation |
|
HOST |
OUT |
Number of bytes of workspace needed by |
|
DEVICE |
IN |
Pointer to workspace buffer |
cusparseDenseToSparse()
supports the following index type for representing the sparse vector matB
:
32-bit indices (
CUSPARSE_INDEX_32I
)64-bit indices (
CUSPARSE_INDEX_64I
)
cusparseDenseToSparse()
supports the following data types:
|
---|
|
|
|
|
|
|
|
|
|
cusparseDense2Sparse()
supports the following algorithm:
Algorithm |
Notes |
---|---|
|
Default algorithm |
cusparseDenseToSparse()
has the following properties:
The routine requires no extra storage
The routine supports asynchronous execution
Provides deterministic (bit-wise) results for each run
The routine does not guarantee the indices of
matB
to be sorted
cusparseDenseToSparse()
supports the following optimizations:
CUDA graph capture
Hardware Memory Compression
See cusparseStatus_t for the description of the return status.
Please visit cuSPARSE Library Samples - cusparseDenseToSparse (CSR) and cuSPARSE Library Samples - cusparseDenseToSparse (Blocked-ELL) for code examples.