Release Notes#
cuSPARSELt v0.9.0#
New Features:
Introduced support for B300 (
SM 10.3).
Compatibility notes:
CUDA 12.9 is no longer supported.
Support for the Jetson platform has been discontinued. Jetson Orin and Thor devices should transition to the Arm64 (SBSA) architecture.
Resolved issues
Fixed compatibility issue with C compilers in the header
cusparseLt.hFixed std::string ABI incompatibility with libstdc++.
Known Issues
Performance regression on GB203 observed when the GPU is operating around frequency of 2300 MHz.
Observed an extremely rare SM103 kernel hang under MPS during concurrent multi-process execution, occurring only in long-running production environments with mixed-priority client streams.
cuSPARSELt v0.8.1#
Resolved issues
The performance regressions on H100 and L4 GPUs introduced in v0.8.0 are fixed.
#include <cstddef>is replaced with#include <stddef.h>in the header filecusparseLt.hto ensure compatibility with C compilers.
cuSPARSELt v0.8.0#
New Features:
Introduced support for Thor (
SM 10.1in CUDA 12.9,SM 11.0in CUDA 13.0), and DGX Spark (SM 12.1).Introduced
E4M3andE5M2kernels forSM 8.9.Added forward compatibility for non-block-scaled data type combinations to ensure support for future architectures.
Better performance for
SM 10.0kernels (up to 18%) andFP16kernels onSM 9.0(up to 10%) .
API Changes:
Introduced
cusparseLtGetErrorString()andcusparseLtGetErrorName()to remove dependency on cuSPARSE library.Introduced
cusparseLtAlgSelectionDestroy(). This function must be invoked to release the resources used by an instance of an algorithm selection.
Compatibility notes:
This release is compatible with CUDA 12.9 and CUDA 13.0.
Resolved issues
Fixed host memory alignment of
cusparseLtMatmulDescAttribute_tandcusparseLtMatmulAlgSelection_tthat could cause crashes.cusparseLtSpMMACompressSize()andcusparseLtSpMMACompressSize2()returned more memory forE2M1kernels.
Known Issues
If used in pure C code,
#include <cstddef>has to be replaced with#include <stddef.h>in the header filecusparseLt.hto ensure compatibility with C compilers.Benchmarks show an average 8.5% performance regression on L4 GPUs. Optimization is planned for the next release.
cuSPARSELt v0.7.1#
Resolved issues
The binary size has been significantly reduced.
cusparseLtSpMMAompressed2()could crash forE4M3and generate incorrect results forFP32onSM 9.0.
cuSPARSELt v0.7.0#
New Features:
Introduced Blackwell support (
SM 10.0andSM 12.0).Added new block-scaled kernels for the following data type combinations for the
SM 10.0andSM 12.0architectures:E4M3input/output,FP32Tensor Core compute; the data type of matrix C can be eitherFP16orBF16E4M3input,FP16output,FP32Tensor Core computeE4M3input,BF16output,FP32Tensor Core computeE4M3input,FP32output,FP32Tensor Core computeE2M1input/output,FP32Tensor Core compute; the data type of matrix C can be eitherFP16orBF16E2M1input,FP16output,FP32Tensor Core computeE2M1input,BF16output,FP32Tensor Core computeE2M1input,FP32output,FP32Tensor Core compute
API Changes:
- Introduced the following API changes to set up the scaling factors for the block scaled kernels:
Added new enumerator type
cusparseLtMatmulMatrixScale_tto specify scaling mode that defines how scaling factor pointers are interpreted.Added new
cusparseLtMatmulAlgSelection_tenumerator valuesCUSPARSELT_MATMUL_{A,B,C,D, D_OUT}_SCALE_MODEto specify the scaling mode of typecusparseLtMatmulMatrixScale_tfor the corresponding matrix.CUSPARSELT_MATMUL_{A,B,C,D, D_OUT}_SCALE_POINTERto set a device pointer to a scalar or a tensor of scaling factors for the corresponding matrix, depending on the matrix’s scaling mode.
Added new values to the enumerator type
cusparseLtSplitKMode_t:CUSPARSELT_HEURISTICCUSPARSELT_DATAPRALLELCUSPARSELT_SPLITKCUSPARSELT_STREAMK
Resolved issues
Fixed multi-GPU configurations causing crashes or hanging.
Compatibility notes:
cuSPARSELt requires CUDA 12.8 or above, and compatible drivers (see CUDA Driver Release Notes).
Support for Ubuntu 18.04, RHEL 7 and CentOs 7 has been removed.
The size of cuSPARSELt data types,
cusparseLtHandle_t,cusparseLtMatDescriptor_t,cusparseLtMatmulDescAttribute_t,cusparseLtMatmulAlgSelection_t, andcusparseLtMatmulPlan_t, is reduced.The static library for
Windows x86-64is no longer provided. Please use the dynamic library instead.
Known Issues
cusparseLtSpMMAompressedSize2()allocates slightly more memory than needed.
cuSPARSELt v0.6.3#
Resolved issues
Sparse GEMM could produce incorrect results on
Arm64ifcusparseLtSpMMACompressSize2()andcusparseLtSpMMACompress2()are used.
Compatibility notes:
Add support for Ubuntu 24.04.
cuSPARSELt v0.6.2#
New Features:
Introduced Orin support (
SM 8.7).Improved performance of the following kernels for
SM 8.0FP16input/output,FP32Tensor Core accumulateBF16input/output,FP32Tensor Core accumulateINT8input,FP16output,INT32Tensor Core computeINT8input,BF16output,INT32Tensor Core computeINT8input,INT32output,INT32Tensor Core compute
API Changes:
Added a new enumerator value
cusparseLtMatmulDescAttribute_t::CUSPARSELT_MATMUL_SPARSE_MAT_POINTERto set the pointer to the pruned sparse matrix.
cuSPARSELt v0.6.1#
Dependencies:
The static linking to the CUDA driver library (
libcuda.soon Linux andcuda.libon Windows) is removed.
Compatibility notes:
The constraints on matrix sizes (
cusparseLtStructuredDescriptorInitandcusparseLtDenseDescriptorInit) have been relaxed. The maximum number of elements for each dimension (rows and columns) of matricesCandDis limited to 2097120.
Resolved issues
cusparseLtSpMMACompressedSize()andcusparseLtSpMMACompressedSize2()requires slightly less memory.Sparse GEMM could produce incorrect results on
SM 8.0.
cuSPARSELt v0.6.0#
New Features:
Introduced Hopper support (
SM 9.0).Added new kernels for the following data type combinations for the
SM 9.0architecture:FP16input/output,FP16Tensor Core computeE4M3input/output,FP32Tensor Core compute; the data type of matrix C can be eitherFP16orBF16E4M3input,FP16output,FP32Tensor Core computeE4M3input,BF16output,FP32Tensor Core computeE4M3input,FP32output,FP32Tensor Core computeE5M2input/output,FP32Tensor Core compute; the data type of matrix C can be eitherFP16orBF16E5M2input,FP16output,FP32Tensor Core computeE5M2input,BF16output,FP32Tensor Core computeE5M2input,FP32output,FP32Tensor Core compute
Driver Requirements:
cuSPARSELt requires CUDA Driver
r535 TRD7,r550 TRD1or higher.
API Changes:
The following APIs are deprecated:
cusparseLtSpMMAPrune2(),cusparseLtSpMMAPruneCheck2(),cusparseLtSpMMACompressedSize2(),cusparseLtSpMMACompress2().
Dependencies:
cuSPARSELt now requires link to the CUDA driver library (
libcuda.soon Linux andcuda.libon Windows).
Known Issues
cusparseLtSpMMAompressedSize()andcusparseLtSpMMAompressedSize2()allocate slightly more memory than needed. This issue will be addressed in the next release.
cuSPARSELt v0.5.2#
New Features:
Added a new kernel for the following data type combination:
INT8inputs,BF16output,INT32Tensor Core accumulateSymbols are obfuscated in the static library.
Compatibility notes:
Added support for RHEL 7 and CentOs 7.
Split-k is enabled in
cusparseLtMatmulSearch()across a broader range of problem dimensions.CUSPARSE_COMPUTE_16F,CUSPARSE_COMPUTE_TF32,CUSPARSE_COMPUTE_TF32_FASTenumerators have been removed for thecusparseComputeTypeenumerator and replaced withCUSPARSE_COMPUTE_32Fto better express the accuracy of the computation at tensor core level.
cuSPARSELt v0.5.0#
New Features:
Added a new kernel for the following data type combination:
INT8inputs,INT32output,INT32Tensor Core accumulate
Compatibility notes:
cuSPARSELt requires CUDA 12.0 or above, and compatible driver (see CUDA Driver Release Notes).
cusparseLtMatmulAlgSelectionInit()does not ensure the same ordering of algorithm idalgas in v0.4.0.
cuSPARSELt v0.4.0#
New Features:
Introduced
SM 8.9compatibilityThe initialization time of cuSPARSELt descriptors have been significantly improved
cusparseLtMatmulSearch()efficiency has been improvedRemoved any internal memory allocations
Added a new kernel for supporting the following data type combination:
INT8input,INT32Tensor Core compute,FP16outputAdded
cusparseLtGetVersion()andcusparseLtGetProperty()functions to retrieve the library version
API Changes:
cusparseLtSpMMACompressedSize(),cusparseLtSpMMACompress(),cusparseLtSpMMACompressedSize2(),cusparseLtSpMMACompress2()have a new parameter to avoid internal memory allocations and support user-provided device memory buffer for the compression
Compatibility notes:
cuSPARSELt requires CUDA Driver 470.xx (CUDA 11.4) or above
cuSPARSELt now uses the static version of the
cudartlibraryThe support for Ubuntu 16.04 (gcc-5) has been removed
cuSPARSELt v0.3.0#
New Features:
Added support for vectors of alpha and beta scalars (per-channel scaling)
Added support for GeLU scaling
Added support for Split-K Mode
Full support for logging functionalities and NVTX ranges
API Changes:
cusparseLtMatmulGetWorkspace()API to get workspace size needed bycusparseLtMatmul()
Resolved issues:
Fixed documentation issue regarding structured matrix size constraints
cuSPARSELt v0.2.0#
New Features:
Added support for activation functions and bias vector:
ReLU + upper bound and threshold setting for all kernels
GeLU for
INT8input/output,INT32Tensor Core compute kernels
Added support for Batched Sparse GEMM:
Single sparse matrix / Multiple dense matrices (Broadcast)
Multiple sparse and dense matrices
Batched bias vector
Compatibility notes:
cuSPARSELt does not require the
nvrtclibrary anymoreSupport for Ubuntu 16.04 (gcc-5) is now deprecated and it will be removed in future releases
cuSPARSELt v0.1.0#
New Features:
Added support for
Windows x86-64andLinux Arm64platformsIntroduced
SM 8.6compatibilityAdded new kernels:
FP32inputs/output,TF32Tensor Core computeTF32inputs/output,TF32Tensor Core compute
Better performance for
SM 8.0kernels (up to 90% SOL)New APIs for compression and pruning decoupled from
cusparseLtMatmulPlan_t
Compatibility notes:
cuSPARSELt requires CUDA 11.2 or above
cusparseLtMatDescriptor_tmust be destroyed withcusparseLtMatDescriptorDestroyfunctionBoth static and shared libraries must be linked with the
nvrtclibraryOn Linux systems, both static and shared libraries must be linked with the
dllibrary
Resolved issues:
CUSPARSELT_MATMUL_SEARCH_ITERATIONSis now handled correctly
cuSPARSELt v0.0.1#
New Features:
Initial release
Support
Linux x86_64andSM 8.0Provide the following mixed-precision computation kernels:
FP16inputs/output,FP32Tensor Core accumulateBF16inputs/output,FP32Tensor Core accumulateINT8inputs/output,INT32Tensor Core compute
Compatibility notes:
cuSPARSELt requires CUDA 11.0 or above