Release Notes¶
cuTENSOR v2.0.2¶
cutensorMg
allow replicated tensors
improve performance and improve support surface
cuTENSOR v2.0.0¶
Added support for just-in-time compilation of tensor contraction kernels
JIT’ed kernels can be stored/loaded to/from disc
Added support for 3XTF32 compute type
Added support for padding output tensors via cutensorPermute
Added support for 64-bit extents
Plan cache is activated by default now (i.e., switched default from opt-in to opt-out)
New APIs enable users to query the actually-used workspace requirement (leading to a reduced overall memory footprint)
cutensorTensorDescriptor_t supports arbitrarily dimensional tensors
Key API changes:
All tensor operations use the plan-based multi-stage API
cutensorOp_t moved from the tensor descriptor to the operation
Alignment moved from the operation to the tensor descriptor
All APIs us heap-allocated opaque data structures
Compatibility Notes:
Increased cuBLASLt version requirement: 11.3.1 (from CUDA toolkit 11.2)
Removed support for CUDA Toolkit 10.2
Removed support for RHEL 7
cuTENSOR v1.7.0¶
Deprecated cutensorInit; please use cutensorCreate and cutensorDestroy instead
cuTENSOR v1.6.2¶
Extended support for data type conversions (e.g. fp16 <-> fp32, see cutensorPermutation).
Fixed issues related to CUDA_MODULE_LOADING=LAZY
Compatibility Notes:
Deprecated CUDA Toolkit 10.2 support; it will be removed after the following release.
cuTENSOR v1.6.1¶
Added support for
SM 9.0
cuTENSOR v1.6.0¶
Significantly improved performance of
cutensorPermutation()
for alpha == 1 (i.e., without scaling).
cuTENSOR v1.5.0¶
Further improved support for high-dimensional tensor contractions (with more than 28 modes).
Analyzing cuTENSOR via NVIDIA’s
compute-sanitizer
will no longer produces false-positive CUDA API errors.CUTENSOR_STATUS_NOT_INITIALIZED is now returned when any opaque data structure (e.g., cutensorTensorDescriptor_t) is not initialized
Added support for tensor contraction where a mode only appears in one of the input tensors (e.g., C[m,n] = A[m,k,c]*B[n,k], in this case both the modes ‘c’ and ‘k’ will be contracted)
Add support for contractions with host tensors in cuTENSORMg for aarch64, ppc64le and x86_64
Compatibility notes:
Deprecated cutensorContractionGetWorkspace; please use cutensorContractionGetWorkspaceSize instead
Deprecated cutensorReductionGetWorkspace; please use cutensorReductionGetWorkspaceSize instead
Resolved issues:
Fixed a bug related to the reductions of tensors that resulted in a scalar.
Significantly improved tensor contraction performance for complex fp32 inputs while using tf32 compute.
Known Issues:
cuTENSORMg contraction execution may fail for some block sizes during multi-GPU runs where the GPUs are not connected with NVLink.
cuTENSOR v1.4.0¶
Preview: Added support for distributed, multi-GPU tensor operations
Support up to 64-dimensional tensors
Added more accurate, but also more time-consuming, heuristic (accessible via CUTENSOR_ALGO_DEFAULT_PATIENT)
cuTENSOR v1.3.3¶
Bugfix for some strided tensor contractions (i.e., those for which all modes are non-contiguous in memory).
Deprecated
Ubuntu 16.04
cuTENSOR v1.3.2¶
Improved tensor contraction performance model (i.e., algo CUTENSOR_ALGO_DEFAULT)
Improved performance for tensor contraction that have a tiny contracted dimension (<= 8)
Improved performance for outer-product-like tensor contractions (e.g., C[a,b,c,d] = A[b,d] * B[a,c])
cuTENSOR v1.3.1¶
Improved tensor contraction performance model (i.e., algo CUTENSOR_ALGO_DEFAULT)
Improved performance for tensor contraction that have an overall large contracted dimension (i.e., a parallel reduction was added)
Compatibility notes:
Binaries provided for CUDA 10.2/11.0/11.x (x>0) for x86_64 and OpenPower
Binaries provided for CUDA 11.0/11.x (x>0) for ARM64
cuTENSOR v1.3.0¶
Support up to 40-dimensional tensors
Support 64-bit strides
Up to 2x performance improvement across library
Support for BF16 Element-wise operations
Compatibility notes:
Not binary compatible with previous versions, due to added int64 stride support.
Binaries provided for CUDA 10.2/11.0/11.x (x>0) for x86_64 and OpenPower
Binaries provided for CUDA 11.0/11.x (x>0) for ARM64
Resolved issues:
Fixed bug with mixed real-complex contraction and strided data
cuTENSOR v1.2.2¶
Improved performance for Element-wise operations
Compatibility notes:
Binaries provided for CUDA 10.1/10.2/11.x for x86_64 and OpenPower
Binaries provided for CUDA 11.x for ARM64
cuTENSOR v1.2.1¶
Added examples to https://github.com/NVIDIA/CUDALibrarySamples
Compatibility notes:
Requires a sufficiently recent (GCC 5 or higher) libstdc++ when linking statically
Binaries provided for CUDA 10.1/10.2/11.0/11.1 for x86_64 and OpenPower
Binaries provided for CUDA 11.0/11.1 for ARM64
cuTENSOR v1.2.0¶
Support for cache plans and autotuning
Support BF16 for Elementwise and Reduction
Compatibility notes:
Binaries provided for CUDA 10.1/10.2/11.0 for x86_64 and OpenPower
Binaries provided for CUDA 11.0 for ARM64
cuTENSOR v1.1.0¶
Support for CUDA 11.0
Added support for
Windows 10 x86_64
andLinux ARM64
platformsAdded support for
SM 8.0
Support third generation Tensor Cores
Improved performance
Compatibility notes:
Binaries provided for CUDA 10.1/10.2/11.0 for x86_64 and OpenPower
Binaries provided for CUDA 11.0 for ARM64
cuTENSOR v1.0.1¶
Added support for
SM 6.0
cuTENSOR v1.0.0¶
Initial release
Support for
SM 7.0
Support mixed-precision operations
Support device-side alpha and beta
Support C != D
Compatibility notes:
cuTENSOR requires CUDA 10.1/10.2 for for x86_64 and OpenPower