Release Notes¶
cuTENSOR v2.1.0¶
- Added support for Ubuntu 24.04 
- Removed support for OpenPower 
- Added support for trinary tensor contractions 
- Added support for - SM 10.0and- SM 10.2(i.e., NVIDIA Blackwell)
cuTENSOR v2.0.2¶
- cutensorMg - allow replicated tensors 
- improve performance and improve support surface 
 
cuTENSOR v2.0.0¶
- Added support for just-in-time compilation of tensor contraction kernels - JIT’ed kernels can be stored/loaded to/from disc 
 
- Added support for 3XTF32 compute type 
- Added support for padding output tensors via cutensorPermute 
- Added support for 64-bit extents 
- Plan cache is activated by default now (i.e., switched default from opt-in to opt-out) 
- New APIs enable users to query the actually-used workspace requirement (leading to a reduced overall memory footprint) 
- cutensorTensorDescriptor_t supports arbitrarily dimensional tensors 
- Key API changes: - All tensor operations use the plan-based multi-stage API 
- cutensorOp_t moved from the tensor descriptor to the operation 
- Alignment moved from the operation to the tensor descriptor 
- All APIs us heap-allocated opaque data structures 
 
Compatibility Notes:
- Increased cuBLASLt version requirement: 11.3.1 (from CUDA toolkit 11.2) 
- Removed support for CUDA Toolkit 10.2 
- Removed support for RHEL 7 
cuTENSOR v1.7.0¶
- Deprecated cutensorInit; please use cutensorCreate and cutensorDestroy instead 
cuTENSOR v1.6.2¶
- Extended support for data type conversions (e.g. fp16 <-> fp32, see cutensorPermutation). 
- Fixed issues related to CUDA_MODULE_LOADING=LAZY 
Compatibility Notes:
- Deprecated CUDA Toolkit 10.2 support; it will be removed after the following release. 
cuTENSOR v1.6.1¶
- Added support for - SM 9.0
cuTENSOR v1.6.0¶
- Significantly improved performance of - cutensorPermutation()for alpha == 1 (i.e., without scaling).
cuTENSOR v1.5.0¶
- Further improved support for high-dimensional tensor contractions (with more than 28 modes). 
- Analyzing cuTENSOR via NVIDIA’s - compute-sanitizerwill no longer produces false-positive CUDA API errors.
- CUTENSOR_STATUS_NOT_INITIALIZED is now returned when any opaque data structure (e.g., cutensorTensorDescriptor_t) is not initialized 
- Added support for tensor contraction where a mode only appears in one of the input tensors (e.g., C[m,n] = A[m,k,c]*B[n,k], in this case both the modes ‘c’ and ‘k’ will be contracted) 
- Add support for contractions with host tensors in cuTENSORMg for aarch64, ppc64le and x86_64 
Compatibility notes:
- Deprecated cutensorContractionGetWorkspace; please use cutensorContractionGetWorkspaceSize instead 
- Deprecated cutensorReductionGetWorkspace; please use cutensorReductionGetWorkspaceSize instead 
Resolved issues:
- Fixed a bug related to the reductions of tensors that resulted in a scalar. 
- Significantly improved tensor contraction performance for complex fp32 inputs while using tf32 compute. 
Known Issues:
- cuTENSORMg contraction execution may fail for some block sizes during multi-GPU runs where the GPUs are not connected with NVLink. 
cuTENSOR v1.4.0¶
- Preview: Added support for distributed, multi-GPU tensor operations 
- Support up to 64-dimensional tensors 
- Added more accurate, but also more time-consuming, heuristic (accessible via CUTENSOR_ALGO_DEFAULT_PATIENT) 
cuTENSOR v1.3.3¶
- Bugfix for some strided tensor contractions (i.e., those for which all modes are non-contiguous in memory). 
- Deprecated - Ubuntu 16.04
cuTENSOR v1.3.2¶
- Improved tensor contraction performance model (i.e., algo CUTENSOR_ALGO_DEFAULT) 
- Improved performance for tensor contraction that have a tiny contracted dimension (<= 8) 
- Improved performance for outer-product-like tensor contractions (e.g., C[a,b,c,d] = A[b,d] * B[a,c]) 
cuTENSOR v1.3.1¶
- Improved tensor contraction performance model (i.e., algo CUTENSOR_ALGO_DEFAULT) 
- Improved performance for tensor contraction that have an overall large contracted dimension (i.e., a parallel reduction was added) 
Compatibility notes:
- Binaries provided for CUDA 10.2/11.0/11.x (x>0) for x86_64 and OpenPower 
- Binaries provided for CUDA 11.0/11.x (x>0) for ARM64 
cuTENSOR v1.3.0¶
- Support up to 40-dimensional tensors 
- Support 64-bit strides 
- Up to 2x performance improvement across library 
- Support for BF16 Element-wise operations 
Compatibility notes:
- Not binary compatible with previous versions, due to added int64 stride support. 
- Binaries provided for CUDA 10.2/11.0/11.x (x>0) for x86_64 and OpenPower 
- Binaries provided for CUDA 11.0/11.x (x>0) for ARM64 
Resolved issues:
- Fixed bug with mixed real-complex contraction and strided data 
cuTENSOR v1.2.2¶
- Improved performance for Element-wise operations 
Compatibility notes:
- Binaries provided for CUDA 10.1/10.2/11.x for x86_64 and OpenPower 
- Binaries provided for CUDA 11.x for ARM64 
cuTENSOR v1.2.1¶
- Added examples to https://github.com/NVIDIA/CUDALibrarySamples 
Compatibility notes:
- Requires a sufficiently recent (GCC 5 or higher) libstdc++ when linking statically 
- Binaries provided for CUDA 10.1/10.2/11.0/11.1 for x86_64 and OpenPower 
- Binaries provided for CUDA 11.0/11.1 for ARM64 
cuTENSOR v1.2.0¶
- Support for cache plans and autotuning 
- Support BF16 for Elementwise and Reduction 
Compatibility notes:
- Binaries provided for CUDA 10.1/10.2/11.0 for x86_64 and OpenPower 
- Binaries provided for CUDA 11.0 for ARM64 
cuTENSOR v1.1.0¶
- Support for CUDA 11.0 
- Added support for - Windows 10 x86_64and- Linux ARM64platforms
- Added support for - SM 8.0
- Support third generation Tensor Cores 
- Improved performance 
Compatibility notes:
- Binaries provided for CUDA 10.1/10.2/11.0 for x86_64 and OpenPower 
- Binaries provided for CUDA 11.0 for ARM64 
cuTENSOR v1.0.1¶
- Added support for - SM 6.0
cuTENSOR v1.0.0¶
- Initial release 
- Support for - SM 7.0
- Support mixed-precision operations 
- Support device-side alpha and beta 
- Support C != D 
Compatibility notes:
- cuTENSOR requires CUDA 10.1/10.2 for for x86_64 and OpenPower