Release Notes

cuTensorNet v2.4.0

Compatibility notes:

  • cuTensorNet requires cuTENSOR v2.0.1 or above.

  • cuQuantum will drop support for RHEL 7 in the following cuQuantum release. Please plan ahead with this in mind. Thank you.

Known issues:

  • For the MPS computations based on cutensornetStateFinalizeMPS() APIs, if the state has different extents on different modes and there are operators applied to two non-adjacent modes, the exact MPS factorization may not be computed.

cuTensorNet v2.3.0

Compatibility notes:

  • cuTensorNet requires cuTENSOR v1.6.1 or above, but is not compatible with v2.x.y. cuTENSOR v1.7.0 is recommended, for performance improvements, bug fixes, and the CUDA Lazy Loading support.

cuTensorNet v2.2.1

  • Bugs fixed:

    • Fix a regression leading to a “not supported” error for unary (single-operand) contractions.

cuTensorNet v2.2.0

Compatibility notes:

  • cuTensorNet supports Ubuntu 20.04+.

Known issues:

cuTensorNet v2.1.0

  • New functionalities:

    • Support for caching intermediate tensors for subsequent reuse in repeated tensor network contractions. This is a useful feature that results in a substantial speedup when users want to perform more than one execution of a tensor network contraction, where a large fraction of the input tensors stays constant, while the rest update their values. For example, computing amplitudes of individual bit-strings or small batches of bit-strings can benefit from this feature. We provide users with an opportunity to specify which tensors are constant. Subsequently, cuTensorNet will use this information to build internal data structures to cache constant intermediate tensors for their reuse in repeated executions of the tensor network contraction plan. Note that, if all input tensors are marked constant, the output tensor becomes constant as well, thus there is no benefit to contracting the network again, as such, the caching mechanism will not be triggered. Repeated contractions in this case will incur the same execution time.

  • Bugs fixed:

    • Failure of cutensornetTensorQR() when users provide a customized memory pool to compute the QR factorization of double complex data with certain extent combinations.

    • Failed autotune in some corner cases with “insufficient workspace” error.

    • Failed execution of cutensornetTensorSVD() when all singular values are trimmed out. For cuTensorNet v2.1.0, one singular value will be retained in the output for such cases. This behavior may be subject to change in a future release.

  • Other changes:

    • The cuTensorNet-MPI wrapper library ( needs to be linked to the MPI library If you use our conda-forge packages or cuQuantum Appliance container, or compile your own using the provided script, this is taken care for you.

    • Introduce support for CUDA 12.

    • A set of new wheels with suffix -cu12 are released on for CUDA 12 users.

      • Example: pip install cutensornet-cu12 for installing cuTensorNet compatible with CUDA 12.

      • The existing cuquantum wheel (without the -cuXX suffix) is turned into an automated installer that will attempt to detect the current CUDA environment and install the appropriate wheels. Please note that this automated detection may encounter conditions under which detection is unsuccessful, especially in a CPU-only environment (such as CI/CD). If detection fails we assume that the target environment is CUDA 11 and proceed. This assumption may be changed in a future release, and in such cases we recommend that users explicitly (manually) install the correct wheels.

  • Performance enhancements:

    • CUDA Lazy Loading is supported. This can significantly reduce memory footprint by deferring the loading of needed GPU kernels to the first call sites. This feature requires CUDA 11.8 (or above) and cuTENSOR 1.7.0 (or above). Please refer to the CUDA documentation for other requirements and details. Currently this feature requires users to opt in by setting the environment variable CUDA_MODULE_LOADING=LAZY. In a future CUDA version, lazy loading may become the default.

Compatibility notes:

  • cuTensorNet requires cuTENSOR 1.6.1 or above, but cuTENSOR 1.7.0 or above is recommended, for performance improvements, bug fixes, and the CUDA Lazy Loading support.

  • cuTensorNet supports Ubuntu 18.04+

    • In the next release, Ubuntu 18.04 will be dropped. The minimum supported Ubuntu version will be 20.04.

cuTensorNet v2.0.0

  • We are on NVIDIA/cuQuantum GitHub Discussions! For any questions regarding (or exciting works built upon) cuQuantum, please feel free to reach out to us on GitHub Discussions.

  • Major release:

    • A conda package is released on conda-forge: conda install -c conda-forge cutensornet. Users can still obtain both cuTensorNet and cuStateVec with conda install -c conda-forge cuquantum, as before.

    • A pip wheel is released on PyPI: pip install cutensornet-cu11. Users can still obtain both cuTensorNet and cuStateVec with pip install cuquantum, as before.

      • Currently, the cuquantum meta-wheel points to the cuquantum-cu11 meta-wheel (which then points to cutensornet-cu11 and custatevec-cu11 wheels). This may change in a future release when a new CUDA version becomes available. Using wheels with the -cuXX suffix is encouraged.

  • New functionalities:

    • Initial support for Hopper users. This requires CUDA 11.8.

    • New APIs to create, query, and destroy tensor descriptor objects.

    • New APIs and functionalities for approximate tensor network algorithms. cuTensorNet now supports the computational primitives mentioned below to enable users to develop approximate tensor network simulators for quantum circuits including MPS, PEPS, and more:

      • Tensor decomposition via QR or SVD. Both exact and truncated SVD supported.

      • Application of a gate to a pair of connected tensors followed by compression.

    • New APIs to create, tune, query, and destroy tensor SVD truncation settings.

    • New APIs to create, query, and destroy runtime tensor SVD truncation information.

    • Automatic distributed execution: cuTensorNet API is extended to include functions enabling automated distributed parallelization of tensor network contractions across multiple GPUs. Once activated, the parallelization is applied to both tensor network contraction path finding (when hyper-sampling is enabled) and contraction execution, without making any changes to the original serial source code.

  • Functionalities introduced that break previous APIs:

  • Bugs fixed:

    • Memory access error when running cuda-memcheck in a few corner cases.

    • Logging related bug upon setting some attributes.

    • Inaccurate flops computed by cuTensorNet with user-provided path & slicing.

    • “Undefined symbol” error when using cuTensorNet in the NVIDIA HPC SDK container.

    • Incorrect handling of extent-1 modes in the deprecated cutensornetGetOutputTensorDetails() API.

  • Performance enhancements:

    • Improved performance of the contraction path optimization process. On average, about 3X speedup was observed on many problems.

    • Improved performance of the contraction auto-tuning process.

    • Improved the quality of the slicing algorithm. We now select the configuration with the minimum number of slices that has the minimal flops overhead.

    • More auto-tuning heuristics added that improves tensor contraction performance.

  • Other changes:

Compatibility notes:

  • cuTensorNet requires cuTENSOR 1.6.1 or above, but cuTENSOR 1.6.2 or above is recommended, for performance improvements and bug fixes.

  • cuTensorNet requires CUDA 11.x, but CUDA 11.8 is recommended, for Hopper support, performance improvements, and bug fixes.

Known issues:

  • With CUDA 11.7 or lower, cutensornetTensorQR() can potentially fail for certain extents.

  • cutensornetTensorQR() can potentially fail when users provide a customized memory pool to compute the QR factorization of double complex data with certain extent combinations.

  • With cuTENSOR 1.6.1 and Turing, broadcasting tensor modes with extent-1 might fail in certain cases.

cuTensorNet v1.1.1

  • Bugs fixed:

    • The version constraint cuTENSOR>=1.5,<2 as promised elsewhere in the documentation was not correctly respected. Both the code and various package sources are now fixed.

cuTensorNet v1.1.0

  • New APIs and functionalities introduced:

  • Functionality/performance improvements:

    • Since near optimal paths are easily found for small networks without simplification, and since simplification does not guarantee an optimal path, the simplification phase has been turned OFF by default when the simplified network is sufficiently small.

    • A new slicing algorithm has been developed, leading to potentially more efficient slicing solutions.

    • Improve contraction performance by optimizing intermediate mode-ordering.

    • Improve contraction performance of networks that have many singleton mode labels.

  • Bugs fixed:

    • Previously, in rare circumstances, the slicing algorithm could fail to make progress toward finding a valid solution, resulting in an infinite loop. This has been fixed.

    • A bug in the deprecated cutensornetContraction() API that accepted sliceId >= numSlices.

  • Other changes:

    • Provide a distributed (MPI-based) C sample that shows how easy it is to use cuTensorNet and create parallelism.

    • Update the (non-distributed) C sample by improving memory usage and employing the new contraction API cutensornetContractSlices().

cuTensorNet v1.0.1

  • Bugs fixed:

  • Performance enhancements:

    • This release improved the support for generalized einsum expression to provide a better contraction path.

  • Other changes:

    • The Overview and Examples pages are significantly improved!

    • Clarify in the documentation and sample that the contraction over slices needs to be done in ascending order, and that when parallelizing over the slices the output tensor should be zero-initialized.

    • Clarify in the documentation that the returned FLOP count assumes real-valued inputs.

    • Several issues in the C++ sample (samples/cutensornet/ are fixed.

cuTensorNet v1.0.0

Compatibility notes:

  • cuTensorNet requires CUDA 11.x.

  • cuTensorNet requires cuTENSOR 1.5.0 or above.

  • cuTensorNet requires OpenMP runtime (GOMP).

  • cuTensorNet no longer requires NVIDIA HPC SDK.

Limitation notes:

  • If multiple slices are created, the order of contracting over slices using cutensornetContraction() should be ascending starting from slice 0. If parallelizing over slices manually (in any fashion: streams, devices, processes, etc.), please make sure the output tensors (that are subject to a global reduction) are zero-initialized.

cuTensorNet v0.1.0

  • Initial public release

  • Add support for Linux ppc64le

  • Add new APIs and functionalities for:

    • Fine-tuning the slicing algorithm

    • Reconfiguring a tensor network

    • Simplifying a tensor network

    • Optimizing pathfinder parameters using the hyperoptimizer

    • Retrieving the optimizer configuration parameters

  • API changes:

Compatibility notes:

  • cuTensorNet requires cuTENSOR 1.4.0 or above

  • cuTensorNet requires NVIDIA HPC SDK 21.11 or above

cuTensorNet v0.0.1

  • Initial release

  • Support Linux x86_64 and Linux Arm64

  • Support Volta and Ampere architectures (compute capability 7.0+)

Compatibility notes:

  • cuTensorNet requires CUDA 11.4 or above

  • cuTensorNet requires cuTENSOR 1.3.3 or above

  • cuTensorNet supports NVIDIA HPC SDK 21.7 or above

Limitation notes:

  • This release is optimized for NVIDIA A100 and V100 GPUs.