Release Notes#
cuSPARSELt v0.6.3#
Resolved issues
Sparse GEMM could produce incorrect results on
Arm64
ifcusparseLtSpMMACompressSize2()
andcusparseLtSpMMACompress()
are used.
Compatibility notes:
Add support for Ubuntu 24.04.
cuSPARSELt v0.6.2#
New Features:
Introduced Orin support (
SM 8.7
).Improved performance of the following kernels for
SM 8.0
FP16
input/output,FP32
Tensor Core accumulateBF16
input/output,FP32
Tensor Core accumulateINT8
input,FP16
output,INT32
Tensor Core computeINT8
input,BF16
output,INT32
Tensor Core computeINT8
input,INT32
output,INT32
Tensor Core compute
API Changes:
Added a new enumerator value
cusparseLtMatmulDescAttribute_t::CUSPARSELT_MATMUL_SPARSE_MAT_POINTER
to set the pointer to the pruned sparse matrix.
cuSPARSELt v0.6.1#
Dependencies:
The static linking to the CUDA driver library (
libcuda.so
on Linux andcuda.lib
on Windows) is removed.
Compatibility notes:
The constraints on matrix sizes (
cusparseLtStructuredDescriptorInit
andcusparseLtDenseDescriptorInit
) have been relaxed. The maximum number of elements for each dimension (rows and columns) of matricesC
andD
is limited to 2097120.
Resolved issues
cusparseLtSpMMACompressedSize()
andcusparseLtSpMMACompressedSize2()
requires slightly less memory.Sparse GEMM could produce incorrect results on
SM 8.0
.
cuSPARSELt v0.6.0#
New Features:
Introduced Hopper support (
SM 9.0
).Added new kernels for the following data type combinations for the
SM 9.0
architecture:FP16
input/output,FP16
Tensor Core computeE4M3
input/output,FP32
Tensor Core compute; the data type of matrix C can be eitherFP16
orBF16
E4M3
input,FP16
output,FP32
Tensor Core computeE4M3
input,BF16
output,FP32
Tensor Core computeE4M3
input,FP32
output,FP32
Tensor Core computeE5M2
input/output,FP32
Tensor Core compute; the data type of matrix C can be eitherFP16
orBF16
E5M2
input,FP16
output,FP32
Tensor Core computeE5M2
input,BF16
output,FP32
Tensor Core computeE5M2
input,FP32
output,FP32
Tensor Core compute
Driver Requirements:
cuSPARSELt requires CUDA Driver
r535 TRD7
,r550 TRD1
or higher.
API Changes:
The following APIs are deprecated:
cusparseLtSpMMAPrune2()
,cusparseLtSpMMAPruneCheck2()
,cusparseLtSpMMACompressedSize2()
,cusparseLtSpMMACompress2()
.
Dependencies:
cuSPARSELt now requires link to the CUDA driver library (
libcuda.so
on Linux andcuda.lib
on Windows).
Known Issues
cusparseLtSpMMAompressedSize()
andcusparseLtSpMMAompressedSize2()
allocate slightly more memory than needed. This issue will be addressed in the next release.
cuSPARSELt v0.5.2#
New Features:
Added a new kernel for the following data type combination:
INT8
inputs,BF16
output,INT32
Tensor Core accumulateSymbols are obfuscated in the static library.
Compatibility notes:
Added support for RHEL 7 and CentOs 7.
Split-k is enabled in
cusparseLtMatmulSearch()
across a broader range of problem dimensions.CUSPARSE_COMPUTE_16F
,CUSPARSE_COMPUTE_TF32
,CUSPARSE_COMPUTE_TF32_FAST
enumerators have been removed for thecusparseComputeType
enumerator and replaced withCUSPARSE_COMPUTE_32F
to better express the accuracy of the computation at tensor core level.
cuSPARSELt v0.5.0#
New Features:
Added a new kernel for the following data type combination:
INT8
inputs,INT32
output,INT32
Tensor Core accumulate
Compatibility notes:
cuSPARSELt requires CUDA 12.0 or above, and compatible driver (see CUDA Driver Release Notes).
cusparseLtMatmulAlgSelectionInit()
does not ensure the same ordering of algorithm idalg
as in v0.4.0.
cuSPARSELt v0.4.0#
New Features:
Introduced
SM 8.9
compatibilityThe initialization time of cuSPARSELt descriptors have been significantly improved
cusparseLtMatmulSearch()
efficiency has been improvedRemoved any internal memory allocations
Added a new kernel for supporting the following data type combination:
INT8
input,INT32
Tensor Core compute,FP16
outputAdded
cusparseLtGetVersion()
andcusparseLtGetProperty()
functions to retrieve the library version
API Changes:
cusparseLtSpMMACompressedSize()
,cusparseLtSpMMACompress()
,cusparseLtSpMMACompressedSize2()
,cusparseLtSpMMACompress2()
have a new parameter to avoid internal memory allocations and support user-provided device memory buffer for the compression
Compatibility notes:
cuSPARSELt requires CUDA Driver 470.xx (CUDA 11.4) or above
cuSPARSELt now uses the static version of the
cudart
libraryThe support for Ubuntu 16.04 (gcc-5) has been removed
cuSPARSELt v0.3.0#
New Features:
Added support for vectors of alpha and beta scalars (per-channel scaling)
Added support for GeLU scaling
Added support for Split-K Mode
Full support for logging functionalities and NVTX ranges
API Changes:
cusparseLtMatmulGetWorkspace()
API to get workspace size needed bycusparseLtMatmul()
Resolved issues:
Fixed documentation issue regarding structured matrix size constraints
cuSPARSELt v0.2.0#
New Features:
Added support for activation functions and bias vector:
ReLU + upper bound and threshold setting for all kernels
GeLU for
INT8
input/output,INT32
Tensor Core compute kernels
Added support for Batched Sparse GEMM:
Single sparse matrix / Multiple dense matrices (Broadcast)
Multiple sparse and dense matrices
Batched bias vector
Compatibility notes:
cuSPARSELt does not require the
nvrtc
library anymoreSupport for Ubuntu 16.04 (gcc-5) is now deprecated and it will be removed in future releases
cuSPARSELt v0.1.0#
New Features:
Added support for
Windows x86-64
andLinux Arm64
platformsIntroduced
SM 8.6
compatibilityAdded new kernels:
FP32
inputs/output,TF32
Tensor Core computeTF32
inputs/output,TF32
Tensor Core compute
Better performance for
SM 8.0
kernels (up to 90% SOL)New APIs for compression and pruning decoupled from
cusparseLtMatmulPlan_t
Compatibility notes:
cuSPARSELt requires CUDA 11.2 or above
cusparseLtMatDescriptor_t
must be destroyed withcusparseLtMatDescriptorDestroy
functionBoth static and shared libraries must be linked with the
nvrtc
libraryOn Linux systems, both static and shared libraries must be linked with the
dl
library
Resolved issues:
CUSPARSELT_MATMUL_SEARCH_ITERATIONS
is now handled correctly
cuSPARSELt v0.0.1#
New Features:
Initial release
Support
Linux x86_64
andSM 8.0
Provide the following mixed-precision computation kernels:
FP16
inputs/output,FP32
Tensor Core accumulateBF16
inputs/output,FP32
Tensor Core accumulateINT8
inputs/output,INT32
Tensor Core compute
Compatibility notes:
cuSPARSELt requires CUDA 11.0 or above