Release Notes

cuTENSOR v2.0.1

  • Fix: Replaced C++ include with C include (i.e., cstdint with stdint.h)

  • Fix: Bug related to cutensorEstimateWorkspaceSize that could result in a not supported contraction on SM75

cuTENSOR v2.0.0

  • Added support for just-in-time compilation of tensor contraction kernels

    • JIT’ed kernels can be stored/loaded to/from disc

  • Added support for 3XTF32 compute type

  • Added support for padding output tensors via cutensorPermute

  • Added support for int64 extents

  • Plan cache is activated by default now (i.e., switched default from opt-in to opt-out)

  • New APIs enable users to query the actually-used workspace requirement (leading to a reduced overall memory footprint)

  • cutensorTensorDescriptor_t supports arbitrarily dimensional tensors

  • Key API changes:

    • All tensor operations use the plan-based multi-stage API

    • cutensorOp_t moved from the tensor descriptor to the operation

    • Alignment moved from the operation to the tensor descriptor

    • All APIs us heap-allocated opaque data structures

Compatibility Notes:

  • Increased cuBLASLt version requirement: 11.3.1 (from CUDA toolkit 11.2)

  • Removed support for CUDA Toolkit 10.2

  • Removed support for RHEL 7

cuTENSOR v1.7.0

  • Deprecated cutensorInit; please use cutensorCreate and cutensorDestroy instead

cuTENSOR v1.6.2

  • Extended support for data type conversions (e.g. fp16 <-> fp32, see cutensorPermutation).

  • Fixed issues related to CUDA_MODULE_LOADING=LAZY

Compatibility Notes:

  • Deprecated CUDA Toolkit 10.2 support; it will be removed after the following release.

cuTENSOR v1.6.1

cuTENSOR v1.6.0

  • Significantly improved performance of cutensorPermutation() for alpha == 1 (i.e., without scaling).

cuTENSOR v1.5.0

  • Further improved support for high-dimensional tensor contractions (with more than 28 modes).

  • Analyzing cuTENSOR via NVIDIA’s compute-sanitizer will no longer produces false-positive CUDA API errors.

  • CUTENSOR_STATUS_NOT_INITIALIZED is now returned when any opaque data structure (e.g., cutensorTensorDescriptor_t) is not initialized

  • Added support for tensor contraction where a mode only appears in one of the input tensors (e.g., C[m,n] = A[m,k,c]*B[n,k], in this case both the modes ‘c’ and ‘k’ will be contracted)

  • Add support for contractions with host tensors in cuTENSORMg for aarch64, ppc64le and x86_64

Compatibility notes:

  • Deprecated cutensorContractionGetWorkspace; please use cutensorContractionGetWorkspaceSize instead

  • Deprecated cutensorReductionGetWorkspace; please use cutensorReductionGetWorkspaceSize instead

Resolved issues:

  • Fixed a bug related to the reductions of tensors that resulted in a scalar.

  • Significantly improved tensor contraction performance for complex fp32 inputs while using tf32 compute.

Known Issues:

  • cuTENSORMg contraction execution may fail for some block sizes during multi-GPU runs where the GPUs are not connected with NVLink.

cuTENSOR v1.4.0

  • Preview: Added support for distributed, multi-GPU tensor operations

  • Support up to 64-dimensional tensors

  • Added more accurate, but also more time-consuming, heuristic (accessible via CUTENSOR_ALGO_DEFAULT_PATIENT)

cuTENSOR v1.3.3

  • Bugfix for some strided tensor contractions (i.e., those for which all modes are non-contiguous in memory).

  • Deprecated Ubuntu 16.04

cuTENSOR v1.3.2

  • Improved tensor contraction performance model (i.e., algo CUTENSOR_ALGO_DEFAULT)

  • Improved performance for tensor contraction that have a tiny contracted dimension (<= 8)

  • Improved performance for outer-product-like tensor contractions (e.g., C[a,b,c,d] = A[b,d] * B[a,c])

cuTENSOR v1.3.1

  • Improved tensor contraction performance model (i.e., algo CUTENSOR_ALGO_DEFAULT)

  • Improved performance for tensor contraction that have an overall large contracted dimension (i.e., a parallel reduction was added)

Compatibility notes:

  • Binaries provided for CUDA 10.2/11.0/11.x (x>0) for x86_64 and OpenPower

  • Binaries provided for CUDA 11.0/11.x (x>0) for ARM64

cuTENSOR v1.3.0

  • Support up to 40-dimensional tensors

  • Support 64-bit strides

  • Up to 2x performance improvement across library

  • Support for BF16 Element-wise operations

Compatibility notes:

  • Not binary compatible with previous versions, due to added int64 stride support.

  • Binaries provided for CUDA 10.2/11.0/11.x (x>0) for x86_64 and OpenPower

  • Binaries provided for CUDA 11.0/11.x (x>0) for ARM64

Resolved issues:

  • Fixed bug with mixed real-complex contraction and strided data

cuTENSOR v1.2.2

  • Improved performance for Element-wise operations

Compatibility notes:

  • Binaries provided for CUDA 10.1/10.2/11.x for x86_64 and OpenPower

  • Binaries provided for CUDA 11.x for ARM64

cuTENSOR v1.2.1

Compatibility notes:

  • Requires a sufficiently recent (GCC 5 or higher) libstdc++ when linking statically

  • Binaries provided for CUDA 10.1/10.2/11.0/11.1 for x86_64 and OpenPower

  • Binaries provided for CUDA 11.0/11.1 for ARM64

cuTENSOR v1.2.0

  • Support for cache plans and autotuning

  • Support BF16 for Elementwise and Reduction

Compatibility notes:

  • Binaries provided for CUDA 10.1/10.2/11.0 for x86_64 and OpenPower

  • Binaries provided for CUDA 11.0 for ARM64

cuTENSOR v1.1.0

  • Support for CUDA 11.0

  • Added support for Windows 10 x86_64 and Linux ARM64 platforms

  • Added support for SM 8.0

  • Support third generation Tensor Cores

  • Improved performance

Compatibility notes:

  • Binaries provided for CUDA 10.1/10.2/11.0 for x86_64 and OpenPower

  • Binaries provided for CUDA 11.0 for ARM64

cuTENSOR v1.0.1

  • Added support for SM 6.0

cuTENSOR v1.0.0

  • Initial release

  • Support for SM 7.0

  • Support mixed-precision operations

  • Support device-side alpha and beta

  • Support C != D

Compatibility notes:

  • cuTENSOR requires CUDA 10.1/10.2 for for x86_64 and OpenPower