################################################################################
Release Notes
################################################################################

================================================================================
cuSPARSELt v0.7.1
================================================================================


*Resolved issues*

* The binary size has been significantly reduced.
* `cusparseLtSpMMAompressed2()` could crash for `E4M3` and generate incorrect results for `FP32` on `SM 9.0`.


================================================================================
cuSPARSELt v0.7.0
================================================================================

*New Features*:

* Introduced Blackwell support (`SM 10.0` and `SM 12.0`).
* Added new *block-scaled* kernels for the following data type combinations for the `SM 10.0` and `SM 12.0` architectures:

    * `E4M3` input/output, `FP32` Tensor Core compute; the data type of matrix C can be either `FP16` or `BF16`
    * `E4M3` input, `FP16` output, `FP32` Tensor Core compute
    * `E4M3` input, `BF16` output, `FP32` Tensor Core compute
    * `E4M3` input, `FP32` output, `FP32` Tensor Core compute
    * `E2M1` input/output, `FP32` Tensor Core compute; the data type of matrix C can be either `FP16` or `BF16`
    * `E2M1` input, `FP16` output, `FP32` Tensor Core compute
    * `E2M1` input, `BF16` output, `FP32` Tensor Core compute
    * `E2M1` input, `FP32` output, `FP32` Tensor Core compute

*API Changes*:

* Introduced the following API changes to set up the scaling factors for the block scaled kernels:
    * Added new enumerator type `cusparseLtMatmulMatrixScale_t` to specify scaling mode that defines how scaling factor pointers are interpreted.
    * Added new `cusparseLtMatmulAlgSelection_t` enumerator values 
  
        * `CUSPARSELT_MATMUL_{A,B,C,D, D_OUT}_SCALE_MODE` to specify the scaling mode of type `cusparseLtMatmulMatrixScale_t` for the corresponding matrix.
        * `CUSPARSELT_MATMUL_{A,B,C,D, D_OUT}_SCALE_POINTER` to set a device pointer to a scalar or a tensor of scaling factors for the corresponding matrix, depending on the matrix's scaling mode.
* Added new values to the enumerator type `cusparseLtSplitKMode_t`: 

    * `CUSPARSELT_HEURISTIC` 
    * `CUSPARSELT_DATAPRALLEL`
    * `CUSPARSELT_SPLITK` 
    * `CUSPARSELT_STREAMK` 

*Resolved issues*

* Fixed multi-GPU configurations causing crashes or hanging.

*Compatibility notes*:

* *cuSPARSELt* requires CUDA 12.8 or above, and compatible drivers (see `CUDA Driver Release Notes <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions>`_).
* Support for *Ubuntu 18.04*, *RHEL 7* and *CentOs 7* has been removed.

* The size of *cuSPARSELt* data types, `cusparseLtHandle_t`, `cusparseLtMatDescriptor_t`, `cusparseLtMatmulDescAttribute_t`, `cusparseLtMatmulAlgSelection_t`, and `cusparseLtMatmulPlan_t`, is reduced.

* The static library for `Windows x86-64` is no longer provided. Please use the dynamic library instead.

*Known Issues*

* `cusparseLtSpMMAompressedSize2()` allocates slightly more memory than needed.
  
----

================================================================================
cuSPARSELt v0.6.3
================================================================================

*Resolved issues*

* Sparse GEMM could produce incorrect results on `Arm64` if `cusparseLtSpMMACompressSize2()` and `cusparseLtSpMMACompress2()` are used.

*Compatibility notes*:

* Add support for *Ubuntu 24.04*.
  
----

================================================================================
cuSPARSELt v0.6.2
================================================================================

*New Features*:

* Introduced Orin support (`SM 8.7`).
* Improved performance of the following kernels for `SM 8.0`

    * `FP16` input/output, `FP32` Tensor Core accumulate
    * `BF16` input/output, `FP32` Tensor Core accumulate
    * `INT8` input, `FP16` output, `INT32` Tensor Core compute
    * `INT8` input, `BF16` output, `INT32` Tensor Core compute
    * `INT8` input, `INT32` output, `INT32` Tensor Core compute

*API Changes*:

* Added a new enumerator value `cusparseLtMatmulDescAttribute_t::CUSPARSELT_MATMUL_SPARSE_MAT_POINTER` to set the pointer to the pruned sparse matrix.

----

================================================================================
cuSPARSELt v0.6.1
================================================================================

*Dependencies*:

* The static linking to the CUDA driver library (`libcuda.so` on Linux and `cuda.lib` on Windows) is removed.

*Compatibility notes*:

* The constraints on matrix sizes (`cusparseLtStructuredDescriptorInit` and `cusparseLtDenseDescriptorInit`) have been relaxed. The maximum number of elements for each dimension (rows and columns) of matrices `C` and `D` is limited to 2097120.

*Resolved issues*

* `cusparseLtSpMMACompressedSize()` and `cusparseLtSpMMACompressedSize2()` requires slightly less memory.
* Sparse GEMM could produce incorrect results on `SM 8.0`.

----

================================================================================
cuSPARSELt v0.6.0
================================================================================

*New Features*:

* Introduced Hopper support (`SM 9.0`).
* Added new kernels for the following data type combinations for the `SM 9.0` architecture:

    * `FP16` input/output, `FP16` Tensor Core compute
    * `E4M3` input/output, `FP32` Tensor Core compute; the data type of matrix C can be either `FP16` or `BF16`
    * `E4M3` input, `FP16` output, `FP32` Tensor Core compute
    * `E4M3` input, `BF16` output, `FP32` Tensor Core compute
    * `E4M3` input, `FP32` output, `FP32` Tensor Core compute
    * `E5M2` input/output, `FP32` Tensor Core compute; the data type of matrix C can be either `FP16` or `BF16`
    * `E5M2` input, `FP16` output, `FP32` Tensor Core compute
    * `E5M2` input, `BF16` output, `FP32` Tensor Core compute
    * `E5M2` input, `FP32` output, `FP32` Tensor Core compute

*Driver Requirements*:

* *cuSPARSELt* requires CUDA Driver `r535 TRD7`, `r550 TRD1` or higher.

*API Changes*:

* The following APIs are deprecated: `cusparseLtSpMMAPrune2()`, `cusparseLtSpMMAPruneCheck2()`, `cusparseLtSpMMACompressedSize2()`, `cusparseLtSpMMACompress2()`.

*Dependencies*:

* *cuSPARSELt* now requires link to the CUDA driver library (`libcuda.so` on Linux and `cuda.lib` on Windows).

*Known Issues*

* `cusparseLtSpMMAompressedSize()` and `cusparseLtSpMMAompressedSize2()` allocate slightly more memory than needed. This issue will be addressed in the next release.

----

================================================================================
cuSPARSELt v0.5.2
================================================================================

*New Features*:

* Added a new kernel for the following data type combination: `INT8` inputs, `BF16` output, `INT32` Tensor Core accumulate
* Symbols are obfuscated in the static library.

*Compatibility notes*:

* Added support for *RHEL 7* and *CentOs 7*.
* Split-k is enabled in `cusparseLtMatmulSearch()` across a broader range of problem dimensions.
* `CUSPARSE_COMPUTE_16F`, `CUSPARSE_COMPUTE_TF32`, `CUSPARSE_COMPUTE_TF32_FAST` enumerators have been removed for the `cusparseComputeType` enumerator and replaced with  `CUSPARSE_COMPUTE_32F` to better express the accuracy of the computation at tensor core level.

----

================================================================================
cuSPARSELt v0.5.0
================================================================================

*New Features*:

* Added a new kernel for the following data type combination: `INT8` inputs, `INT32` output, `INT32` Tensor Core accumulate

*Compatibility notes*:

* *cuSPARSELt* requires CUDA 12.0 or above, and compatible driver (see `CUDA Driver Release Notes <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-major-component-versions>`_).
* `cusparseLtMatmulAlgSelectionInit()` does not ensure the same ordering of algorithm id `alg` as in v0.4.0.

----

================================================================================
cuSPARSELt v0.4.0
================================================================================

*New Features*:

* Introduced `SM 8.9` compatibility
* The initialization time of cuSPARSELt descriptors have been significantly improved
* `cusparseLtMatmulSearch()` efficiency has been improved
* Removed any internal memory allocations
* Added a new kernel for supporting the following data type combination: `INT8` input, `INT32` Tensor Core compute, `FP16` output
* Added `cusparseLtGetVersion()` and `cusparseLtGetProperty()` functions to retrieve the library version

*API Changes*:

* `cusparseLtSpMMACompressedSize()`, `cusparseLtSpMMACompress()`, `cusparseLtSpMMACompressedSize2()`, `cusparseLtSpMMACompress2()` have a new parameter to avoid internal memory allocations and support user-provided device memory buffer for the compression

*Compatibility notes*:

* *cuSPARSELt* requires CUDA Driver 470.xx (CUDA 11.4) or above
* cuSPARSELt now uses the static version of the `cudart` library
* The support for *Ubuntu 16.04* (gcc-5) has been removed

----

================================================================================
cuSPARSELt v0.3.0
================================================================================

*New Features*:

* Added support for *vectors of alpha and beta scalars* (per-channel scaling)

* Added support for *GeLU scaling*

* Added support for *Split-K Mode*

* Full support for logging functionalities and NVTX ranges

*API Changes*:

* `cusparseLtMatmulGetWorkspace()` API to get workspace size needed by `cusparseLtMatmul()`

*Resolved issues*:

* Fixed documentation issue regarding structured matrix size constraints

----

================================================================================
cuSPARSELt v0.2.0
================================================================================

*New Features*:

* Added support for *activation functions* and *bias vector*:

    - ReLU + upper bound and threshold setting for all kernels
    - GeLU for `INT8` input/output, `INT32` Tensor Core compute kernels

* Added support for *Batched Sparse GEMM*:

    - Single sparse matrix / Multiple dense matrices (*Broadcast*)
    - Multiple sparse and dense matrices
    - Batched bias vector

*Compatibility notes*:

* *cuSPARSELt* does not require the `nvrtc` library anymore
* Support for *Ubuntu 16.04* (gcc-5) is now deprecated and it will be
  removed in future releases

----

================================================================================
cuSPARSELt v0.1.0
================================================================================

*New Features*:

* Added support for `Windows x86-64` and `Linux Arm64` platforms
* Introduced `SM 8.6` compatibility
* Added new kernels:

    - `FP32` inputs/output, `TF32` Tensor Core compute
    - `TF32` inputs/output, `TF32` Tensor Core compute
* Better performance for `SM 8.0` kernels (up to 90% SOL)
* New APIs for compression and pruning decoupled from `cusparseLtMatmulPlan_t`

*Compatibility notes*:

* *cuSPARSELt* requires CUDA 11.2 or above
* `cusparseLtMatDescriptor_t` must be destroyed with `cusparseLtMatDescriptorDestroy` function
* Both *static* and *shared* libraries must be linked with the `nvrtc` library
* On Linux systems, both *static* and *shared* libraries must be linked with the `dl` library

*Resolved issues*:

* `CUSPARSELT_MATMUL_SEARCH_ITERATIONS` is now handled correctly

----

================================================================================
cuSPARSELt v0.0.1
================================================================================

*New Features*:

* Initial release
* Support `Linux x86_64` and `SM 8.0`
* Provide the following mixed-precision computation kernels:

    * `FP16` inputs/output, `FP32` Tensor Core accumulate
    * `BF16` inputs/output, `FP32` Tensor Core accumulate
    * `INT8` inputs/output, `INT32` Tensor Core compute

*Compatibility notes*:

* *cuSPARSELt* requires CUDA 11.0 or above