Release Notes¶
cuTENSOR v2.0.1¶
Fix: Replaced C++ include with C include (i.e., cstdint with stdint.h)
Fix: Bug related to cutensorEstimateWorkspaceSize that could result in a not supported contraction on SM75
cuTENSOR v2.0.0¶
Added support for just-in-time compilation of tensor contraction kernels
JIT’ed kernels can be stored/loaded to/from disc
Added support for 3XTF32 compute type
Added support for padding output tensors via cutensorPermute
Added support for int64 extents
Plan cache is activated by default now (i.e., switched default from opt-in to opt-out)
New APIs enable users to query the actually-used workspace requirement (leading to a reduced overall memory footprint)
cutensorTensorDescriptor_t supports arbitrarily dimensional tensors
Key API changes:
All tensor operations use the plan-based multi-stage API
cutensorOp_t moved from the tensor descriptor to the operation
Alignment moved from the operation to the tensor descriptor
All APIs us heap-allocated opaque data structures
Compatibility Notes:
Increased cuBLASLt version requirement: 11.3.1 (from CUDA toolkit 11.2)
Removed support for CUDA Toolkit 10.2
Removed support for RHEL 7
cuTENSOR v1.7.0¶
Deprecated cutensorInit; please use cutensorCreate and cutensorDestroy instead
cuTENSOR v1.6.2¶
Extended support for data type conversions (e.g. fp16 <-> fp32, see cutensorPermutation).
Fixed issues related to CUDA_MODULE_LOADING=LAZY
Compatibility Notes:
Deprecated CUDA Toolkit 10.2 support; it will be removed after the following release.
cuTENSOR v1.6.1¶
Added support for
SM 9.0
cuTENSOR v1.6.0¶
Significantly improved performance of
cutensorPermutation()
for alpha == 1 (i.e., without scaling).
cuTENSOR v1.5.0¶
Further improved support for high-dimensional tensor contractions (with more than 28 modes).
Analyzing cuTENSOR via NVIDIA’s
compute-sanitizer
will no longer produces false-positive CUDA API errors.CUTENSOR_STATUS_NOT_INITIALIZED is now returned when any opaque data structure (e.g., cutensorTensorDescriptor_t) is not initialized
Added support for tensor contraction where a mode only appears in one of the input tensors (e.g., C[m,n] = A[m,k,c]*B[n,k], in this case both the modes ‘c’ and ‘k’ will be contracted)
Add support for contractions with host tensors in cuTENSORMg for aarch64, ppc64le and x86_64
Compatibility notes:
Deprecated cutensorContractionGetWorkspace; please use cutensorContractionGetWorkspaceSize instead
Deprecated cutensorReductionGetWorkspace; please use cutensorReductionGetWorkspaceSize instead
Resolved issues:
Fixed a bug related to the reductions of tensors that resulted in a scalar.
Significantly improved tensor contraction performance for complex fp32 inputs while using tf32 compute.
Known Issues:
cuTENSORMg contraction execution may fail for some block sizes during multi-GPU runs where the GPUs are not connected with NVLink.
cuTENSOR v1.4.0¶
Preview: Added support for distributed, multi-GPU tensor operations
Support up to 64-dimensional tensors
Added more accurate, but also more time-consuming, heuristic (accessible via CUTENSOR_ALGO_DEFAULT_PATIENT)
cuTENSOR v1.3.3¶
Bugfix for some strided tensor contractions (i.e., those for which all modes are non-contiguous in memory).
Deprecated
Ubuntu 16.04
cuTENSOR v1.3.2¶
Improved tensor contraction performance model (i.e., algo CUTENSOR_ALGO_DEFAULT)
Improved performance for tensor contraction that have a tiny contracted dimension (<= 8)
Improved performance for outer-product-like tensor contractions (e.g., C[a,b,c,d] = A[b,d] * B[a,c])
cuTENSOR v1.3.1¶
Improved tensor contraction performance model (i.e., algo CUTENSOR_ALGO_DEFAULT)
Improved performance for tensor contraction that have an overall large contracted dimension (i.e., a parallel reduction was added)
Compatibility notes:
Binaries provided for CUDA 10.2/11.0/11.x (x>0) for x86_64 and OpenPower
Binaries provided for CUDA 11.0/11.x (x>0) for ARM64
cuTENSOR v1.3.0¶
Support up to 40-dimensional tensors
Support 64-bit strides
Up to 2x performance improvement across library
Support for BF16 Element-wise operations
Compatibility notes:
Not binary compatible with previous versions, due to added int64 stride support.
Binaries provided for CUDA 10.2/11.0/11.x (x>0) for x86_64 and OpenPower
Binaries provided for CUDA 11.0/11.x (x>0) for ARM64
Resolved issues:
Fixed bug with mixed real-complex contraction and strided data
cuTENSOR v1.2.2¶
Improved performance for Element-wise operations
Compatibility notes:
Binaries provided for CUDA 10.1/10.2/11.x for x86_64 and OpenPower
Binaries provided for CUDA 11.x for ARM64
cuTENSOR v1.2.1¶
Added examples to https://github.com/NVIDIA/CUDALibrarySamples
Compatibility notes:
Requires a sufficiently recent (GCC 5 or higher) libstdc++ when linking statically
Binaries provided for CUDA 10.1/10.2/11.0/11.1 for x86_64 and OpenPower
Binaries provided for CUDA 11.0/11.1 for ARM64
cuTENSOR v1.2.0¶
Support for cache plans and autotuning
Support BF16 for Elementwise and Reduction
Compatibility notes:
Binaries provided for CUDA 10.1/10.2/11.0 for x86_64 and OpenPower
Binaries provided for CUDA 11.0 for ARM64
cuTENSOR v1.1.0¶
Support for CUDA 11.0
Added support for
Windows 10 x86_64
andLinux ARM64
platformsAdded support for
SM 8.0
Support third generation Tensor Cores
Improved performance
Compatibility notes:
Binaries provided for CUDA 10.1/10.2/11.0 for x86_64 and OpenPower
Binaries provided for CUDA 11.0 for ARM64
cuTENSOR v1.0.1¶
Added support for
SM 6.0
cuTENSOR v1.0.0¶
Initial release
Support for
SM 7.0
Support mixed-precision operations
Support device-side alpha and beta
Support C != D
Compatibility notes:
cuTENSOR requires CUDA 10.1/10.2 for for x86_64 and OpenPower