Release Notes#
cuSPARSELt v0.8.1#
Resolved issues
The performance regressions on H100 and L4 GPUs introduced in v0.8.0 are fixed.
#include <cstddef>is replaced with#include <stddef.h>in the header filecusparseLt.hto ensure compatibility with C compilers.
cuSPARSELt v0.8.0#
New Features:
Introduced support for Thor (
SM 10.1in CUDA 12.9,SM 11.0in CUDA 13.0), and DGX Spark (SM 12.1).Introduced
E4M3andE5M2kernels forSM 8.9.Added forward compatibility for non-block-scaled data type combinations to ensure support for future architectures.
Better performance for
SM 10.0kernels (up to 18%) andFP16kernels onSM 9.0(up to 10%) .
API Changes:
Introduced
cusparseLtGetErrorString()andcusparseLtGetErrorName()to remove dependency on cuSPARSE library.Introduced
cusparseLtAlgSelectionDestroy(). This function must be invoked to release the resources used by an instance of an algorithm selection.
Compatibility notes:
This release is compatible with CUDA 12.9 and CUDA 13.0.
Resolved issues
Fixed host memory alignment of
cusparseLtMatmulDescAttribute_tandcusparseLtMatmulAlgSelection_tthat could cause crashes.cusparseLtSpMMACompressSize()andcusparseLtSpMMACompressSize2()returned more memory forE2M1kernels.
Known Issues
If used in pure C code,
#include <cstddef>has to be replaced with#include <stddef.h>in the header filecusparseLt.hto ensure compatibility with C compilers.Benchmarks show an average 8.5% performance regression on L4 GPUs. Optimization is planned for the next release.
cuSPARSELt v0.7.1#
Resolved issues
The binary size has been significantly reduced.
cusparseLtSpMMAompressed2()could crash forE4M3and generate incorrect results forFP32onSM 9.0.
cuSPARSELt v0.7.0#
New Features:
Introduced Blackwell support (
SM 10.0andSM 12.0).Added new block-scaled kernels for the following data type combinations for the
SM 10.0andSM 12.0architectures:E4M3input/output,FP32Tensor Core compute; the data type of matrix C can be eitherFP16orBF16E4M3input,FP16output,FP32Tensor Core computeE4M3input,BF16output,FP32Tensor Core computeE4M3input,FP32output,FP32Tensor Core computeE2M1input/output,FP32Tensor Core compute; the data type of matrix C can be eitherFP16orBF16E2M1input,FP16output,FP32Tensor Core computeE2M1input,BF16output,FP32Tensor Core computeE2M1input,FP32output,FP32Tensor Core compute
API Changes:
- Introduced the following API changes to set up the scaling factors for the block scaled kernels:
Added new enumerator type
cusparseLtMatmulMatrixScale_tto specify scaling mode that defines how scaling factor pointers are interpreted.Added new
cusparseLtMatmulAlgSelection_tenumerator valuesCUSPARSELT_MATMUL_{A,B,C,D, D_OUT}_SCALE_MODEto specify the scaling mode of typecusparseLtMatmulMatrixScale_tfor the corresponding matrix.CUSPARSELT_MATMUL_{A,B,C,D, D_OUT}_SCALE_POINTERto set a device pointer to a scalar or a tensor of scaling factors for the corresponding matrix, depending on the matrix’s scaling mode.
Added new values to the enumerator type
cusparseLtSplitKMode_t:CUSPARSELT_HEURISTICCUSPARSELT_DATAPRALLELCUSPARSELT_SPLITKCUSPARSELT_STREAMK
Resolved issues
Fixed multi-GPU configurations causing crashes or hanging.
Compatibility notes:
cuSPARSELt requires CUDA 12.8 or above, and compatible drivers (see CUDA Driver Release Notes).
Support for Ubuntu 18.04, RHEL 7 and CentOs 7 has been removed.
The size of cuSPARSELt data types,
cusparseLtHandle_t,cusparseLtMatDescriptor_t,cusparseLtMatmulDescAttribute_t,cusparseLtMatmulAlgSelection_t, andcusparseLtMatmulPlan_t, is reduced.The static library for
Windows x86-64is no longer provided. Please use the dynamic library instead.
Known Issues
cusparseLtSpMMAompressedSize2()allocates slightly more memory than needed.
cuSPARSELt v0.6.3#
Resolved issues
Sparse GEMM could produce incorrect results on
Arm64ifcusparseLtSpMMACompressSize2()andcusparseLtSpMMACompress2()are used.
Compatibility notes:
Add support for Ubuntu 24.04.
cuSPARSELt v0.6.2#
New Features:
Introduced Orin support (
SM 8.7).Improved performance of the following kernels for
SM 8.0FP16input/output,FP32Tensor Core accumulateBF16input/output,FP32Tensor Core accumulateINT8input,FP16output,INT32Tensor Core computeINT8input,BF16output,INT32Tensor Core computeINT8input,INT32output,INT32Tensor Core compute
API Changes:
Added a new enumerator value
cusparseLtMatmulDescAttribute_t::CUSPARSELT_MATMUL_SPARSE_MAT_POINTERto set the pointer to the pruned sparse matrix.
cuSPARSELt v0.6.1#
Dependencies:
The static linking to the CUDA driver library (
libcuda.soon Linux andcuda.libon Windows) is removed.
Compatibility notes:
The constraints on matrix sizes (
cusparseLtStructuredDescriptorInitandcusparseLtDenseDescriptorInit) have been relaxed. The maximum number of elements for each dimension (rows and columns) of matricesCandDis limited to 2097120.
Resolved issues
cusparseLtSpMMACompressedSize()andcusparseLtSpMMACompressedSize2()requires slightly less memory.Sparse GEMM could produce incorrect results on
SM 8.0.
cuSPARSELt v0.6.0#
New Features:
Introduced Hopper support (
SM 9.0).Added new kernels for the following data type combinations for the
SM 9.0architecture:FP16input/output,FP16Tensor Core computeE4M3input/output,FP32Tensor Core compute; the data type of matrix C can be eitherFP16orBF16E4M3input,FP16output,FP32Tensor Core computeE4M3input,BF16output,FP32Tensor Core computeE4M3input,FP32output,FP32Tensor Core computeE5M2input/output,FP32Tensor Core compute; the data type of matrix C can be eitherFP16orBF16E5M2input,FP16output,FP32Tensor Core computeE5M2input,BF16output,FP32Tensor Core computeE5M2input,FP32output,FP32Tensor Core compute
Driver Requirements:
cuSPARSELt requires CUDA Driver
r535 TRD7,r550 TRD1or higher.
API Changes:
The following APIs are deprecated:
cusparseLtSpMMAPrune2(),cusparseLtSpMMAPruneCheck2(),cusparseLtSpMMACompressedSize2(),cusparseLtSpMMACompress2().
Dependencies:
cuSPARSELt now requires link to the CUDA driver library (
libcuda.soon Linux andcuda.libon Windows).
Known Issues
cusparseLtSpMMAompressedSize()andcusparseLtSpMMAompressedSize2()allocate slightly more memory than needed. This issue will be addressed in the next release.
cuSPARSELt v0.5.2#
New Features:
Added a new kernel for the following data type combination:
INT8inputs,BF16output,INT32Tensor Core accumulateSymbols are obfuscated in the static library.
Compatibility notes:
Added support for RHEL 7 and CentOs 7.
Split-k is enabled in
cusparseLtMatmulSearch()across a broader range of problem dimensions.CUSPARSE_COMPUTE_16F,CUSPARSE_COMPUTE_TF32,CUSPARSE_COMPUTE_TF32_FASTenumerators have been removed for thecusparseComputeTypeenumerator and replaced withCUSPARSE_COMPUTE_32Fto better express the accuracy of the computation at tensor core level.
cuSPARSELt v0.5.0#
New Features:
Added a new kernel for the following data type combination:
INT8inputs,INT32output,INT32Tensor Core accumulate
Compatibility notes:
cuSPARSELt requires CUDA 12.0 or above, and compatible driver (see CUDA Driver Release Notes).
cusparseLtMatmulAlgSelectionInit()does not ensure the same ordering of algorithm idalgas in v0.4.0.
cuSPARSELt v0.4.0#
New Features:
Introduced
SM 8.9compatibilityThe initialization time of cuSPARSELt descriptors have been significantly improved
cusparseLtMatmulSearch()efficiency has been improvedRemoved any internal memory allocations
Added a new kernel for supporting the following data type combination:
INT8input,INT32Tensor Core compute,FP16outputAdded
cusparseLtGetVersion()andcusparseLtGetProperty()functions to retrieve the library version
API Changes:
cusparseLtSpMMACompressedSize(),cusparseLtSpMMACompress(),cusparseLtSpMMACompressedSize2(),cusparseLtSpMMACompress2()have a new parameter to avoid internal memory allocations and support user-provided device memory buffer for the compression
Compatibility notes:
cuSPARSELt requires CUDA Driver 470.xx (CUDA 11.4) or above
cuSPARSELt now uses the static version of the
cudartlibraryThe support for Ubuntu 16.04 (gcc-5) has been removed
cuSPARSELt v0.3.0#
New Features:
Added support for vectors of alpha and beta scalars (per-channel scaling)
Added support for GeLU scaling
Added support for Split-K Mode
Full support for logging functionalities and NVTX ranges
API Changes:
cusparseLtMatmulGetWorkspace()API to get workspace size needed bycusparseLtMatmul()
Resolved issues:
Fixed documentation issue regarding structured matrix size constraints
cuSPARSELt v0.2.0#
New Features:
Added support for activation functions and bias vector:
ReLU + upper bound and threshold setting for all kernels
GeLU for
INT8input/output,INT32Tensor Core compute kernels
Added support for Batched Sparse GEMM:
Single sparse matrix / Multiple dense matrices (Broadcast)
Multiple sparse and dense matrices
Batched bias vector
Compatibility notes:
cuSPARSELt does not require the
nvrtclibrary anymoreSupport for Ubuntu 16.04 (gcc-5) is now deprecated and it will be removed in future releases
cuSPARSELt v0.1.0#
New Features:
Added support for
Windows x86-64andLinux Arm64platformsIntroduced
SM 8.6compatibilityAdded new kernels:
FP32inputs/output,TF32Tensor Core computeTF32inputs/output,TF32Tensor Core compute
Better performance for
SM 8.0kernels (up to 90% SOL)New APIs for compression and pruning decoupled from
cusparseLtMatmulPlan_t
Compatibility notes:
cuSPARSELt requires CUDA 11.2 or above
cusparseLtMatDescriptor_tmust be destroyed withcusparseLtMatDescriptorDestroyfunctionBoth static and shared libraries must be linked with the
nvrtclibraryOn Linux systems, both static and shared libraries must be linked with the
dllibrary
Resolved issues:
CUSPARSELT_MATMUL_SEARCH_ITERATIONSis now handled correctly
cuSPARSELt v0.0.1#
New Features:
Initial release
Support
Linux x86_64andSM 8.0Provide the following mixed-precision computation kernels:
FP16inputs/output,FP32Tensor Core accumulateBF16inputs/output,FP32Tensor Core accumulateINT8inputs/output,INT32Tensor Core compute
Compatibility notes:
cuSPARSELt requires CUDA 11.0 or above