Release Notes#

cuSPARSELt v0.6.3#

Resolved issues

  • Sparse GEMM could produce incorrect results on Arm64 if cusparseLtSpMMACompressSize2() and cusparseLtSpMMACompress() are used.

Compatibility notes:

  • Add support for Ubuntu 24.04.


cuSPARSELt v0.6.2#

New Features:

  • Introduced Orin support (SM 8.7).

  • Improved performance of the following kernels for SM 8.0

    • FP16 input/output, FP32 Tensor Core accumulate

    • BF16 input/output, FP32 Tensor Core accumulate

    • INT8 input, FP16 output, INT32 Tensor Core compute

    • INT8 input, BF16 output, INT32 Tensor Core compute

    • INT8 input, INT32 output, INT32 Tensor Core compute

API Changes:

  • Added a new enumerator value cusparseLtMatmulDescAttribute_t::CUSPARSELT_MATMUL_SPARSE_MAT_POINTER to set the pointer to the pruned sparse matrix.


cuSPARSELt v0.6.1#

Dependencies:

  • The static linking to the CUDA driver library (libcuda.so on Linux and cuda.lib on Windows) is removed.

Compatibility notes:

  • The constraints on matrix sizes (cusparseLtStructuredDescriptorInit and cusparseLtDenseDescriptorInit) have been relaxed. The maximum number of elements for each dimension (rows and columns) of matrices C and D is limited to 2097120.

Resolved issues

  • cusparseLtSpMMACompressedSize() and cusparseLtSpMMACompressedSize2() requires slightly less memory.

  • Sparse GEMM could produce incorrect results on SM 8.0.


cuSPARSELt v0.6.0#

New Features:

  • Introduced Hopper support (SM 9.0).

  • Added new kernels for the following data type combinations for the SM 9.0 architecture:

    • FP16 input/output, FP16 Tensor Core compute

    • E4M3 input/output, FP32 Tensor Core compute; the data type of matrix C can be either FP16 or BF16

    • E4M3 input, FP16 output, FP32 Tensor Core compute

    • E4M3 input, BF16 output, FP32 Tensor Core compute

    • E4M3 input, FP32 output, FP32 Tensor Core compute

    • E5M2 input/output, FP32 Tensor Core compute; the data type of matrix C can be either FP16 or BF16

    • E5M2 input, FP16 output, FP32 Tensor Core compute

    • E5M2 input, BF16 output, FP32 Tensor Core compute

    • E5M2 input, FP32 output, FP32 Tensor Core compute

Driver Requirements:

  • cuSPARSELt requires CUDA Driver r535 TRD7, r550 TRD1 or higher.

API Changes:

  • The following APIs are deprecated: cusparseLtSpMMAPrune2(), cusparseLtSpMMAPruneCheck2(), cusparseLtSpMMACompressedSize2(), cusparseLtSpMMACompress2().

Dependencies:

  • cuSPARSELt now requires link to the CUDA driver library (libcuda.so on Linux and cuda.lib on Windows).

Known Issues

  • cusparseLtSpMMAompressedSize() and cusparseLtSpMMAompressedSize2() allocate slightly more memory than needed. This issue will be addressed in the next release.


cuSPARSELt v0.5.2#

New Features:

  • Added a new kernel for the following data type combination: INT8 inputs, BF16 output, INT32 Tensor Core accumulate

  • Symbols are obfuscated in the static library.

Compatibility notes:

  • Added support for RHEL 7 and CentOs 7.

  • Split-k is enabled in cusparseLtMatmulSearch() across a broader range of problem dimensions.

  • CUSPARSE_COMPUTE_16F, CUSPARSE_COMPUTE_TF32, CUSPARSE_COMPUTE_TF32_FAST enumerators have been removed for the cusparseComputeType enumerator and replaced with CUSPARSE_COMPUTE_32F to better express the accuracy of the computation at tensor core level.


cuSPARSELt v0.5.0#

New Features:

  • Added a new kernel for the following data type combination: INT8 inputs, INT32 output, INT32 Tensor Core accumulate

Compatibility notes:

  • cuSPARSELt requires CUDA 12.0 or above, and compatible driver (see CUDA Driver Release Notes).

  • cusparseLtMatmulAlgSelectionInit() does not ensure the same ordering of algorithm id alg as in v0.4.0.


cuSPARSELt v0.4.0#

New Features:

  • Introduced SM 8.9 compatibility

  • The initialization time of cuSPARSELt descriptors have been significantly improved

  • cusparseLtMatmulSearch() efficiency has been improved

  • Removed any internal memory allocations

  • Added a new kernel for supporting the following data type combination: INT8 input, INT32 Tensor Core compute, FP16 output

  • Added cusparseLtGetVersion() and cusparseLtGetProperty() functions to retrieve the library version

API Changes:

  • cusparseLtSpMMACompressedSize(), cusparseLtSpMMACompress(), cusparseLtSpMMACompressedSize2(), cusparseLtSpMMACompress2() have a new parameter to avoid internal memory allocations and support user-provided device memory buffer for the compression

Compatibility notes:

  • cuSPARSELt requires CUDA Driver 470.xx (CUDA 11.4) or above

  • cuSPARSELt now uses the static version of the cudart library

  • The support for Ubuntu 16.04 (gcc-5) has been removed


cuSPARSELt v0.3.0#

New Features:

  • Added support for vectors of alpha and beta scalars (per-channel scaling)

  • Added support for GeLU scaling

  • Added support for Split-K Mode

  • Full support for logging functionalities and NVTX ranges

API Changes:

  • cusparseLtMatmulGetWorkspace() API to get workspace size needed by cusparseLtMatmul()

Resolved issues:

  • Fixed documentation issue regarding structured matrix size constraints


cuSPARSELt v0.2.0#

New Features:

  • Added support for activation functions and bias vector:

    • ReLU + upper bound and threshold setting for all kernels

    • GeLU for INT8 input/output, INT32 Tensor Core compute kernels

  • Added support for Batched Sparse GEMM:

    • Single sparse matrix / Multiple dense matrices (Broadcast)

    • Multiple sparse and dense matrices

    • Batched bias vector

Compatibility notes:

  • cuSPARSELt does not require the nvrtc library anymore

  • Support for Ubuntu 16.04 (gcc-5) is now deprecated and it will be removed in future releases


cuSPARSELt v0.1.0#

New Features:

  • Added support for Windows x86-64 and Linux Arm64 platforms

  • Introduced SM 8.6 compatibility

  • Added new kernels:

    • FP32 inputs/output, TF32 Tensor Core compute

    • TF32 inputs/output, TF32 Tensor Core compute

  • Better performance for SM 8.0 kernels (up to 90% SOL)

  • New APIs for compression and pruning decoupled from cusparseLtMatmulPlan_t

Compatibility notes:

  • cuSPARSELt requires CUDA 11.2 or above

  • cusparseLtMatDescriptor_t must be destroyed with cusparseLtMatDescriptorDestroy function

  • Both static and shared libraries must be linked with the nvrtc library

  • On Linux systems, both static and shared libraries must be linked with the dl library

Resolved issues:

  • CUSPARSELT_MATMUL_SEARCH_ITERATIONS is now handled correctly


cuSPARSELt v0.0.1#

New Features:

  • Initial release

  • Support Linux x86_64 and SM 8.0

  • Provide the following mixed-precision computation kernels:

    • FP16 inputs/output, FP32 Tensor Core accumulate

    • BF16 inputs/output, FP32 Tensor Core accumulate

    • INT8 inputs/output, INT32 Tensor Core compute

Compatibility notes:

  • cuSPARSELt requires CUDA 11.0 or above