Release Notes#

cuSPARSELt v0.7.1#

Resolved issues

The binary size has been significantly reduced.
cusparseLtSpMMAompressed2() could crash for E4M3 and generate incorrect results for FP32 on SM 9.0.

cuSPARSELt v0.7.0#

New Features:

Introduced Blackwell support (SM 10.0 and SM 12.0).
Added new block-scaled kernels for the following data type combinations for the SM 10.0 and SM 12.0 architectures:
- E4M3 input/output, FP32 Tensor Core compute; the data type of matrix C can be either FP16 or BF16
- E4M3 input, FP16 output, FP32 Tensor Core compute
- E4M3 input, BF16 output, FP32 Tensor Core compute
- E4M3 input, FP32 output, FP32 Tensor Core compute
- E2M1 input/output, FP32 Tensor Core compute; the data type of matrix C can be either FP16 or BF16
- E2M1 input, FP16 output, FP32 Tensor Core compute
- E2M1 input, BF16 output, FP32 Tensor Core compute
- E2M1 input, FP32 output, FP32 Tensor Core compute

API Changes:

Introduced the following API changes to set up the scaling factors for the block scaled kernels:
- Added new enumerator type cusparseLtMatmulMatrixScale_t to specify scaling mode that defines how scaling factor pointers are interpreted.
- Added new cusparseLtMatmulAlgSelection_t enumerator values
  CUSPARSELT_MATMUL_{A,B,C,D, D_OUT}_SCALE_MODE to specify the scaling mode of type cusparseLtMatmulMatrixScale_t for the corresponding matrix.
  
  CUSPARSELT_MATMUL_{A,B,C,D, D_OUT}_SCALE_POINTER to set a device pointer to a scalar or a tensor of scaling factors for the corresponding matrix, depending on the matrix’s scaling mode.
Added new values to the enumerator type cusparseLtSplitKMode_t:
- CUSPARSELT_HEURISTIC
- CUSPARSELT_DATAPRALLEL
- CUSPARSELT_SPLITK
- CUSPARSELT_STREAMK

Resolved issues

Fixed multi-GPU configurations causing crashes or hanging.

Compatibility notes:

cuSPARSELt requires CUDA 12.8 or above, and compatible drivers (see CUDA Driver Release Notes).
Support for Ubuntu 18.04, RHEL 7 and CentOs 7 has been removed.
The size of cuSPARSELt data types, cusparseLtHandle_t, cusparseLtMatDescriptor_t, cusparseLtMatmulDescAttribute_t, cusparseLtMatmulAlgSelection_t, and cusparseLtMatmulPlan_t, is reduced.
The static library for Windows x86-64 is no longer provided. Please use the dynamic library instead.

Known Issues

cusparseLtSpMMAompressedSize2() allocates slightly more memory than needed.

cuSPARSELt v0.6.3#

Resolved issues

Sparse GEMM could produce incorrect results on Arm64 if cusparseLtSpMMACompressSize2() and cusparseLtSpMMACompress2() are used.

Compatibility notes:

Add support for Ubuntu 24.04.

cuSPARSELt v0.6.2#

New Features:

Introduced Orin support (SM 8.7).
Improved performance of the following kernels for SM 8.0
- FP16 input/output, FP32 Tensor Core accumulate
- BF16 input/output, FP32 Tensor Core accumulate
- INT8 input, FP16 output, INT32 Tensor Core compute
- INT8 input, BF16 output, INT32 Tensor Core compute
- INT8 input, INT32 output, INT32 Tensor Core compute

API Changes:

Added a new enumerator value cusparseLtMatmulDescAttribute_t::CUSPARSELT_MATMUL_SPARSE_MAT_POINTER to set the pointer to the pruned sparse matrix.

cuSPARSELt v0.6.1#

Dependencies:

The static linking to the CUDA driver library (libcuda.so on Linux and cuda.lib on Windows) is removed.

Compatibility notes:

The constraints on matrix sizes (cusparseLtStructuredDescriptorInit and cusparseLtDenseDescriptorInit) have been relaxed. The maximum number of elements for each dimension (rows and columns) of matrices C and D is limited to 2097120.

Resolved issues

cusparseLtSpMMACompressedSize() and cusparseLtSpMMACompressedSize2() requires slightly less memory.
Sparse GEMM could produce incorrect results on SM 8.0.

cuSPARSELt v0.6.0#

New Features:

Introduced Hopper support (SM 9.0).
Added new kernels for the following data type combinations for the SM 9.0 architecture:
- FP16 input/output, FP16 Tensor Core compute
- E4M3 input/output, FP32 Tensor Core compute; the data type of matrix C can be either FP16 or BF16
- E4M3 input, FP16 output, FP32 Tensor Core compute
- E4M3 input, BF16 output, FP32 Tensor Core compute
- E4M3 input, FP32 output, FP32 Tensor Core compute
- E5M2 input/output, FP32 Tensor Core compute; the data type of matrix C can be either FP16 or BF16
- E5M2 input, FP16 output, FP32 Tensor Core compute
- E5M2 input, BF16 output, FP32 Tensor Core compute
- E5M2 input, FP32 output, FP32 Tensor Core compute

Driver Requirements:

cuSPARSELt requires CUDA Driver r535 TRD7, r550 TRD1 or higher.

API Changes:

The following APIs are deprecated: cusparseLtSpMMAPrune2(), cusparseLtSpMMAPruneCheck2(), cusparseLtSpMMACompressedSize2(), cusparseLtSpMMACompress2().

Dependencies:

cuSPARSELt now requires link to the CUDA driver library (libcuda.so on Linux and cuda.lib on Windows).

Known Issues

cusparseLtSpMMAompressedSize() and cusparseLtSpMMAompressedSize2() allocate slightly more memory than needed. This issue will be addressed in the next release.

cuSPARSELt v0.5.2#

New Features:

Added a new kernel for the following data type combination: INT8 inputs, BF16 output, INT32 Tensor Core accumulate
Symbols are obfuscated in the static library.

Compatibility notes:

Added support for RHEL 7 and CentOs 7.
Split-k is enabled in cusparseLtMatmulSearch() across a broader range of problem dimensions.
CUSPARSE_COMPUTE_16F, CUSPARSE_COMPUTE_TF32, CUSPARSE_COMPUTE_TF32_FAST enumerators have been removed for the cusparseComputeType enumerator and replaced with CUSPARSE_COMPUTE_32F to better express the accuracy of the computation at tensor core level.

cuSPARSELt v0.5.0#

New Features:

Added a new kernel for the following data type combination: INT8 inputs, INT32 output, INT32 Tensor Core accumulate

Compatibility notes:

cuSPARSELt requires CUDA 12.0 or above, and compatible driver (see CUDA Driver Release Notes).
cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of algorithm id alg as in v0.4.0.

cuSPARSELt v0.4.0#

New Features:

Introduced SM 8.9 compatibility
The initialization time of cuSPARSELt descriptors have been significantly improved
cusparseLtMatmulSearch() efficiency has been improved
Removed any internal memory allocations
Added a new kernel for supporting the following data type combination: INT8 input, INT32 Tensor Core compute, FP16 output
Added cusparseLtGetVersion() and cusparseLtGetProperty() functions to retrieve the library version

API Changes:

cusparseLtSpMMACompressedSize(), cusparseLtSpMMACompress(), cusparseLtSpMMACompressedSize2(), cusparseLtSpMMACompress2() have a new parameter to avoid internal memory allocations and support user-provided device memory buffer for the compression

Compatibility notes:

cuSPARSELt requires CUDA Driver 470.xx (CUDA 11.4) or above
cuSPARSELt now uses the static version of the cudart library
The support for Ubuntu 16.04 (gcc-5) has been removed

cuSPARSELt v0.3.0#

New Features:

Added support for vectors of alpha and beta scalars (per-channel scaling)
Added support for GeLU scaling
Added support for Split-K Mode
Full support for logging functionalities and NVTX ranges

API Changes:

cusparseLtMatmulGetWorkspace() API to get workspace size needed by cusparseLtMatmul()

Resolved issues:

Fixed documentation issue regarding structured matrix size constraints

cuSPARSELt v0.2.0#

New Features:

Added support for activation functions and bias vector:
- ReLU + upper bound and threshold setting for all kernels
- GeLU for INT8 input/output, INT32 Tensor Core compute kernels
Added support for Batched Sparse GEMM:
- Single sparse matrix / Multiple dense matrices (Broadcast)
- Multiple sparse and dense matrices
- Batched bias vector

Compatibility notes:

cuSPARSELt does not require the nvrtc library anymore
Support for Ubuntu 16.04 (gcc-5) is now deprecated and it will be removed in future releases

cuSPARSELt v0.1.0#

New Features:

Added support for Windows x86-64 and Linux Arm64 platforms
Introduced SM 8.6 compatibility
Added new kernels:
- FP32 inputs/output, TF32 Tensor Core compute
- TF32 inputs/output, TF32 Tensor Core compute
Better performance for SM 8.0 kernels (up to 90% SOL)
New APIs for compression and pruning decoupled from cusparseLtMatmulPlan_t

Compatibility notes:

cuSPARSELt requires CUDA 11.2 or above
cusparseLtMatDescriptor_t must be destroyed with cusparseLtMatDescriptorDestroy function
Both static and shared libraries must be linked with the nvrtc library
On Linux systems, both static and shared libraries must be linked with the dl library

Resolved issues:

CUSPARSELT_MATMUL_SEARCH_ITERATIONS is now handled correctly

cuSPARSELt v0.0.1#

New Features:

Initial release
Support Linux x86_64 and SM 8.0
Provide the following mixed-precision computation kernels:
- FP16 inputs/output, FP32 Tensor Core accumulate
- BF16 inputs/output, FP32 Tensor Core accumulate
- INT8 inputs/output, INT32 Tensor Core compute

Compatibility notes:

cuSPARSELt requires CUDA 11.0 or above