################################################################################ Release Notes ################################################################################ ================================================================================ cuSPARSELt v0.7.1 ================================================================================ *Resolved issues* * The binary size has been significantly reduced. * `cusparseLtSpMMAompressed2()` could crash for `E4M3` and generate incorrect results for `FP32` on `SM 9.0`. ================================================================================ cuSPARSELt v0.7.0 ================================================================================ *New Features*: * Introduced Blackwell support (`SM 10.0` and `SM 12.0`). * Added new *block-scaled* kernels for the following data type combinations for the `SM 10.0` and `SM 12.0` architectures: * `E4M3` input/output, `FP32` Tensor Core compute; the data type of matrix C can be either `FP16` or `BF16` * `E4M3` input, `FP16` output, `FP32` Tensor Core compute * `E4M3` input, `BF16` output, `FP32` Tensor Core compute * `E4M3` input, `FP32` output, `FP32` Tensor Core compute * `E2M1` input/output, `FP32` Tensor Core compute; the data type of matrix C can be either `FP16` or `BF16` * `E2M1` input, `FP16` output, `FP32` Tensor Core compute * `E2M1` input, `BF16` output, `FP32` Tensor Core compute * `E2M1` input, `FP32` output, `FP32` Tensor Core compute *API Changes*: * Introduced the following API changes to set up the scaling factors for the block scaled kernels: * Added new enumerator type `cusparseLtMatmulMatrixScale_t` to specify scaling mode that defines how scaling factor pointers are interpreted. * Added new `cusparseLtMatmulAlgSelection_t` enumerator values * `CUSPARSELT_MATMUL_{A,B,C,D, D_OUT}_SCALE_MODE` to specify the scaling mode of type `cusparseLtMatmulMatrixScale_t` for the corresponding matrix. * `CUSPARSELT_MATMUL_{A,B,C,D, D_OUT}_SCALE_POINTER` to set a device pointer to a scalar or a tensor of scaling factors for the corresponding matrix, depending on the matrix's scaling mode. * Added new values to the enumerator type `cusparseLtSplitKMode_t`: * `CUSPARSELT_HEURISTIC` * `CUSPARSELT_DATAPRALLEL` * `CUSPARSELT_SPLITK` * `CUSPARSELT_STREAMK` *Resolved issues* * Fixed multi-GPU configurations causing crashes or hanging. *Compatibility notes*: * *cuSPARSELt* requires CUDA 12.8 or above, and compatible drivers (see `CUDA Driver Release Notes `_). * Support for *Ubuntu 18.04*, *RHEL 7* and *CentOs 7* has been removed. * The size of *cuSPARSELt* data types, `cusparseLtHandle_t`, `cusparseLtMatDescriptor_t`, `cusparseLtMatmulDescAttribute_t`, `cusparseLtMatmulAlgSelection_t`, and `cusparseLtMatmulPlan_t`, is reduced. * The static library for `Windows x86-64` is no longer provided. Please use the dynamic library instead. *Known Issues* * `cusparseLtSpMMAompressedSize2()` allocates slightly more memory than needed. ---- ================================================================================ cuSPARSELt v0.6.3 ================================================================================ *Resolved issues* * Sparse GEMM could produce incorrect results on `Arm64` if `cusparseLtSpMMACompressSize2()` and `cusparseLtSpMMACompress2()` are used. *Compatibility notes*: * Add support for *Ubuntu 24.04*. ---- ================================================================================ cuSPARSELt v0.6.2 ================================================================================ *New Features*: * Introduced Orin support (`SM 8.7`). * Improved performance of the following kernels for `SM 8.0` * `FP16` input/output, `FP32` Tensor Core accumulate * `BF16` input/output, `FP32` Tensor Core accumulate * `INT8` input, `FP16` output, `INT32` Tensor Core compute * `INT8` input, `BF16` output, `INT32` Tensor Core compute * `INT8` input, `INT32` output, `INT32` Tensor Core compute *API Changes*: * Added a new enumerator value `cusparseLtMatmulDescAttribute_t::CUSPARSELT_MATMUL_SPARSE_MAT_POINTER` to set the pointer to the pruned sparse matrix. ---- ================================================================================ cuSPARSELt v0.6.1 ================================================================================ *Dependencies*: * The static linking to the CUDA driver library (`libcuda.so` on Linux and `cuda.lib` on Windows) is removed. *Compatibility notes*: * The constraints on matrix sizes (`cusparseLtStructuredDescriptorInit` and `cusparseLtDenseDescriptorInit`) have been relaxed. The maximum number of elements for each dimension (rows and columns) of matrices `C` and `D` is limited to 2097120. *Resolved issues* * `cusparseLtSpMMACompressedSize()` and `cusparseLtSpMMACompressedSize2()` requires slightly less memory. * Sparse GEMM could produce incorrect results on `SM 8.0`. ---- ================================================================================ cuSPARSELt v0.6.0 ================================================================================ *New Features*: * Introduced Hopper support (`SM 9.0`). * Added new kernels for the following data type combinations for the `SM 9.0` architecture: * `FP16` input/output, `FP16` Tensor Core compute * `E4M3` input/output, `FP32` Tensor Core compute; the data type of matrix C can be either `FP16` or `BF16` * `E4M3` input, `FP16` output, `FP32` Tensor Core compute * `E4M3` input, `BF16` output, `FP32` Tensor Core compute * `E4M3` input, `FP32` output, `FP32` Tensor Core compute * `E5M2` input/output, `FP32` Tensor Core compute; the data type of matrix C can be either `FP16` or `BF16` * `E5M2` input, `FP16` output, `FP32` Tensor Core compute * `E5M2` input, `BF16` output, `FP32` Tensor Core compute * `E5M2` input, `FP32` output, `FP32` Tensor Core compute *Driver Requirements*: * *cuSPARSELt* requires CUDA Driver `r535 TRD7`, `r550 TRD1` or higher. *API Changes*: * The following APIs are deprecated: `cusparseLtSpMMAPrune2()`, `cusparseLtSpMMAPruneCheck2()`, `cusparseLtSpMMACompressedSize2()`, `cusparseLtSpMMACompress2()`. *Dependencies*: * *cuSPARSELt* now requires link to the CUDA driver library (`libcuda.so` on Linux and `cuda.lib` on Windows). *Known Issues* * `cusparseLtSpMMAompressedSize()` and `cusparseLtSpMMAompressedSize2()` allocate slightly more memory than needed. This issue will be addressed in the next release. ---- ================================================================================ cuSPARSELt v0.5.2 ================================================================================ *New Features*: * Added a new kernel for the following data type combination: `INT8` inputs, `BF16` output, `INT32` Tensor Core accumulate * Symbols are obfuscated in the static library. *Compatibility notes*: * Added support for *RHEL 7* and *CentOs 7*. * Split-k is enabled in `cusparseLtMatmulSearch()` across a broader range of problem dimensions. * `CUSPARSE_COMPUTE_16F`, `CUSPARSE_COMPUTE_TF32`, `CUSPARSE_COMPUTE_TF32_FAST` enumerators have been removed for the `cusparseComputeType` enumerator and replaced with `CUSPARSE_COMPUTE_32F` to better express the accuracy of the computation at tensor core level. ---- ================================================================================ cuSPARSELt v0.5.0 ================================================================================ *New Features*: * Added a new kernel for the following data type combination: `INT8` inputs, `INT32` output, `INT32` Tensor Core accumulate *Compatibility notes*: * *cuSPARSELt* requires CUDA 12.0 or above, and compatible driver (see `CUDA Driver Release Notes `_). * `cusparseLtMatmulAlgSelectionInit()` does not ensure the same ordering of algorithm id `alg` as in v0.4.0. ---- ================================================================================ cuSPARSELt v0.4.0 ================================================================================ *New Features*: * Introduced `SM 8.9` compatibility * The initialization time of cuSPARSELt descriptors have been significantly improved * `cusparseLtMatmulSearch()` efficiency has been improved * Removed any internal memory allocations * Added a new kernel for supporting the following data type combination: `INT8` input, `INT32` Tensor Core compute, `FP16` output * Added `cusparseLtGetVersion()` and `cusparseLtGetProperty()` functions to retrieve the library version *API Changes*: * `cusparseLtSpMMACompressedSize()`, `cusparseLtSpMMACompress()`, `cusparseLtSpMMACompressedSize2()`, `cusparseLtSpMMACompress2()` have a new parameter to avoid internal memory allocations and support user-provided device memory buffer for the compression *Compatibility notes*: * *cuSPARSELt* requires CUDA Driver 470.xx (CUDA 11.4) or above * cuSPARSELt now uses the static version of the `cudart` library * The support for *Ubuntu 16.04* (gcc-5) has been removed ---- ================================================================================ cuSPARSELt v0.3.0 ================================================================================ *New Features*: * Added support for *vectors of alpha and beta scalars* (per-channel scaling) * Added support for *GeLU scaling* * Added support for *Split-K Mode* * Full support for logging functionalities and NVTX ranges *API Changes*: * `cusparseLtMatmulGetWorkspace()` API to get workspace size needed by `cusparseLtMatmul()` *Resolved issues*: * Fixed documentation issue regarding structured matrix size constraints ---- ================================================================================ cuSPARSELt v0.2.0 ================================================================================ *New Features*: * Added support for *activation functions* and *bias vector*: - ReLU + upper bound and threshold setting for all kernels - GeLU for `INT8` input/output, `INT32` Tensor Core compute kernels * Added support for *Batched Sparse GEMM*: - Single sparse matrix / Multiple dense matrices (*Broadcast*) - Multiple sparse and dense matrices - Batched bias vector *Compatibility notes*: * *cuSPARSELt* does not require the `nvrtc` library anymore * Support for *Ubuntu 16.04* (gcc-5) is now deprecated and it will be removed in future releases ---- ================================================================================ cuSPARSELt v0.1.0 ================================================================================ *New Features*: * Added support for `Windows x86-64` and `Linux Arm64` platforms * Introduced `SM 8.6` compatibility * Added new kernels: - `FP32` inputs/output, `TF32` Tensor Core compute - `TF32` inputs/output, `TF32` Tensor Core compute * Better performance for `SM 8.0` kernels (up to 90% SOL) * New APIs for compression and pruning decoupled from `cusparseLtMatmulPlan_t` *Compatibility notes*: * *cuSPARSELt* requires CUDA 11.2 or above * `cusparseLtMatDescriptor_t` must be destroyed with `cusparseLtMatDescriptorDestroy` function * Both *static* and *shared* libraries must be linked with the `nvrtc` library * On Linux systems, both *static* and *shared* libraries must be linked with the `dl` library *Resolved issues*: * `CUSPARSELT_MATMUL_SEARCH_ITERATIONS` is now handled correctly ---- ================================================================================ cuSPARSELt v0.0.1 ================================================================================ *New Features*: * Initial release * Support `Linux x86_64` and `SM 8.0` * Provide the following mixed-precision computation kernels: * `FP16` inputs/output, `FP32` Tensor Core accumulate * `BF16` inputs/output, `FP32` Tensor Core accumulate * `INT8` inputs/output, `INT32` Tensor Core compute *Compatibility notes*: * *cuSPARSELt* requires CUDA 11.0 or above