*************
Release Notes
*************

==================
cuTensorNet v2.0.0
==================

* We are on `NVIDIA/cuQuantum GitHub Discussions <https://github.com/NVIDIA/cuQuantum/discussions>`_! For any questions regarding (or exciting works built upon) cuQuantum, please feel free to reach out to us on GitHub Discussions.

  * Bug reports should still go to `our GitHub issue tracker <https://github.com/NVIDIA/cuQuantum/issues>`_.

* Major release:

  * A conda package is released on conda-forge: ``conda install -c conda-forge cutensornet``. Users can still obtain both *cuTensorNet* and *cuStateVec* with ``conda install -c conda-forge cuquantum``, as before.

  * A pip wheel is released on PyPI: ``pip install cutensornet-cu11``. Users can still obtain both *cuTensorNet* and *cuStateVec* with ``pip install cuquantum``, as before.

    * Currently, the ``cuquantum`` meta-wheel points to the ``cuquantum-cu11`` meta-wheel (which then points to ``cutensornet-cu11`` and ``custatevec-cu11`` wheels). This may change in a future release when a new CUDA version becomes available. Using wheels with the ``-cuXX`` suffix is encouraged.

* New functionalities:

  * Initial support for Hopper users. This requires CUDA 11.8.
  * New APIs to create, query, and destroy tensor descriptor objects.
  * New APIs and functionalities for approximate tensor network algorithms. *cuTensornet* now supports the computational primitives mentioned below to enable users to develop approximate tensor network simulators for quantum circuits including MPS, PEPS, and more:

    * Tensor decomposition via QR or SVD. Both exact and truncated SVD supported.
    * Application of a gate to a pair of connected tensors followed by compression.

  * New APIs to create, tune, query, and destroy tensor SVD truncation settings.
  * New APIs to create, query, and destroy runtime tensor SVD truncation information.
  * Automatic distributed execution: *cuTensorNet* API is extended to include functions enabling automated distributed parallelization of tensor network contractions across multiple GPUs. Once activated, the parallelization is applied to both tensor network *contraction path finding* (when hyper-sampling is enabled) and *contraction execution*, without making any changes to the original serial source code.

* Functionalities introduced that break previous APIs:

  * Complex conjugation operator on input tensors (this adds an extra parameter that specifies tensor qualifiers in `cutensornetCreateNetworkDescriptor()` API).
  * Provide new API for users to specify the slices as (sliced modes, sliced extents) with one call to `cutensornetContractionOptimizerInfoSetAttribute()` by using the attribute `CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_SLICING_CONFIG`. This will remove a limitation of the old API, which required users to set the modes first before the extents.
  * Removed the [in,out] alignment-requirement parameters from the `cutensornetCreateNetworkDescriptor` API. These are no longer required and are being inferred internally.
  * Some enum values are reordered. If your application stores any of *cuTensorNet* enum values as plain int, please make sure to rebuild your application.

* Bugs fixed:

  * Memory access error when running cuda-memcheck in a few corner cases.
  * Logging related bug upon setting some attributes.
  * Inaccurate flops computed by cuTensorNet with user-provided path & slicing.
  * "Undefined symbol" error when using cuTensorNet in the NVIDIA HPC SDK container.
  * Incorrect handling of extent-1 modes in the deprecated `cutensornetGetOutputTensorDetails` API.

* Performance enhancements:

  * Improved performance of the contraction path optimization process. On average, about 3X speedup was observed on many problems.
  * Improved performance of the contraction auto-tuning process.
  * Improved the quality of the slicing algorithm. We now select the configuration with the minimum number of slices that has the minimal flops overhead.
  * More auto-tuning heuristics added that improves tensor contraction performance.

* Other changes:

  * GNU OpenMP Runtime (gomp) is no longer needed.
  * A new API, `cutensornetWorkspaceComputeContractionSizes`, replaces the `cutensornetWorkspaceComputeSizes` API, which is *deprecated* and will be removed in a future release.
  * Two new APIs, `cutensornetGetOutputTensorDescriptor` and `cutensornetGetTensorDetails`, replace the `cutensornetGetOutputTensorDetails` API, which is *deprecated* and will be removed in a future release.
  * New samples (`samples/cutensornet/ <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet>`_).

*Compatibility notes*:

* *cuTensorNet* requires cuTENSOR 1.6.1 or above, but cuTENSOR 1.6.2 or above is recommended, for performance improvements and bug fixes.
* *cuTensorNet* requires CUDA 11.x, but CUDA 11.8 is recommended, for Hopper support, performance improvements, and bug fixes.

*Known issues*:

* With CUDA 11.7 or lower, `cutensornetTensorQR` can potentially fail for certain extents.
* `cutensornetTensorQR` can potentially fail when users provide a customized memory pool to compute the QR factorization of double complex data with certain extent combinations.
* With cuTENSOR 1.6.1 and Turing, broadcasting tensor modes with extent-1 might fail in certain cases.

==================
cuTensorNet v1.1.1
==================

* Bugs fixed:

  * The version constraint ``cuTENSOR>=1.5,<2`` as promised elsewhere in the documentation was not correctly respected. Both the code and various package sources are now fixed.

==================
cuTensorNet v1.1.0
==================

* New APIs and functionalities introduced:

  * A new API, `cutensornetContractionOptimizerInfoPackData()`, that allows users to serialize/pack the optimizerInfo in order to broadcast it to other ranks. Similarly, another new API for unpacking is provided, `cutensornetUpdateContractionOptimizerInfoFromPackedData()`. 
  * New APIs for creating and destroying slice group objects, which include `cutensornetCreateSliceGroupFromIDRange`, `cutensornetCreateSliceGroupFromIDs` and `cutensornetDestroySliceGroup`. These APIs, when combined with the packing/unpacking APIs above, allow the users to employ the slicing technique to create independent tasks that be run on multiple GPUs.
  * A new API, `cutensornetContractSlices()`, for the execution of the contraction. This will replace the `cutensornetContraction()` API, which is *deprecated* and will be removed in a future release.
  * An option to auto-tune intermediate modes through the `cutensornetContractionAutotune()` API, which helps improve network contraction performance. The functionality of this API call can be controlled with the `CUTENSORNET_CONTRACTION_AUTOTUNE_INTERMEDIATE_MODES` attribute.
  * An option to find a path that minimizes estimated time to solution (rather than FLOP count). This experimental feature can be controlled with the configuration attribute `CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_COST_FUNCTION_OBJECTIVE`.
  * An option to retrieve the mode labels for all intermediate tensors through the `CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_INTERMEDIATE_MODES` attribute of the contraction optimizer-info.

* Functionality/performance improvements:

  * Since near optimal paths are easily found for small networks without simplification, and since simplification does not guarantee an optimal path, the simplification phase has been turned OFF by default when the simplified network is sufficiently small.
  * A new slicing algorithm has been developed, leading to potentially more efficient slicing solutions.
  * Improve contraction performance by optimizing intermediate mode-ordering.
  * Improve contraction performance of networks that have many singleton mode labels.

* Bugs fixed:

  * Previously, in rare circumstances, the slicing algorithm could fail to make progress toward finding a valid solution, resulting in an infinite loop. This has been fixed.
  * A bug in the deprecated `cutensornetContraction()` API that accepted sliceId >= numSlices.

* Other changes:

  * Provide a distributed (MPI-based) C sample that shows how easy it is to use cuTensorNet and create parallelism.
  * Update the (non-distributed) C sample by improving memory usage and employing the new contraction API `cutensornetContractSlices()`.

==================
cuTensorNet v1.0.1
==================

* Bugs fixed:

  * A workspace pointer alignment issue.
  * A potential path optimizer issue to avoid returning `CUTENSORNET_STATUS_NOT_SUPPORTED`.

* Performance enhancements:

  * This release improved the support for generalized einsum expression to provide a better contraction path.

* Other changes:

  * The :doc:`./overview` and :doc:`./getting_started` pages are significantly improved!
  * Clarify in the documentation and sample that the contraction over slices needs to be done in ascending order, and that when parallelizing over the slices the output tensor should be zero-initialized.
  * Clarify in the documentation that the returned FLOP count assumes real-valued inputs.
  * Several issues in the C++ sample (`samples/cutensornet/tensornet_example.cu <https://github.com/NVIDIA/cuQuantum/blob/main/samples/cutensornet/tensornet_example.cu>`_) are fixed.

==================
cuTensorNet v1.0.0
==================

* Functionality/performance improvements:

  * Greatly reduced the workspace memory size required.
  * Reduced the execution time of the pathfinder with multithreading and internal optimization.
  * Support for hyperedges in tensor networks.
  * Support for tensor networks described by generalized Einstein summation expressions.

* Add new APIs and functionalities for:

  * Managing workspace (see :ref:`cuTensorNet workspace management API` for details).
  * Binding a user-provided, *stream-ordered* memory pool to the library (see :ref:`cuTensorNet memory management API` for details).
  * Query of the output tensor details (see `cutensornetGetOutputTensorDetails`).
  * Set the number of threads for the hyperoptimizer (see :ref:`hyperoptimizer` for details).
  * Setting a logger callback with user-provided data (see `cutensornetLoggerSetCallbackData`).

* API changes:

  * Replaced `cutensornetContractionGetWorkspaceSize` with `cutensornetWorkspaceComputeSizes`.
  * `cutensornetCreateContractionPlan`, `cutensornetContractionAutotune`, and `cutensornetContraction` receive a workspace descriptor instead of workspace pointer and size params.
  * Renamed `cutensornetGraphAlgo_t` and `cutensornetMemoryModel_t` enumerations' options.

*Compatibility notes*:

* *cuTensorNet* requires CUDA 11.x.
* *cuTensorNet* requires cuTENSOR 1.5.0 or above.
* *cuTensorNet* requires OpenMP runtime (GOMP).
* *cuTensorNet* no longer requires NVIDIA HPC SDK.

*Limitation notes*:

* If multiple slices are created, the order of contracting over slices using `cutensornetContraction` should be ascending starting from slice 0. If parallelizing over slices manually (in any fashion: streams, devices, processes, etc.), please make sure the output tensors (that are subject to a global reduction) are zero-initialized.

==================
cuTensorNet v0.1.0
==================

* Initial public release
* Add support for ``Linux ppc64le``
* Add new APIs and functionalities for:

  * Fine-tuning the slicing algorithm
  * Reconfiguring a tensor network
  * Simplifying a tensor network
  * Optimizing pathfinder parameters using the hyperoptimizer
  * Retrieving the optimizer configuration parameters

* API changes:

  * ``cutensornetContractionGetWorkspace`` is renamed to `cutensornetContractionGetWorkspaceSize`
  * `cutensornetContractionAutotune`'s function signature has changed

*Compatibility notes*:

* *cuTensorNet* requires cuTENSOR 1.4.0 or above
* *cuTensorNet* requires NVIDIA HPC SDK 21.11 or above

==================
cuTensorNet v0.0.1
==================

* Initial release
* Support ``Linux x86_64`` and ``Linux Arm64``
* Support Volta and Ampere architectures (compute capability 7.0+)

*Compatibility notes*:

* *cuTensorNet* requires CUDA 11.4 or above
* *cuTensorNet* requires cuTENSOR 1.3.3 or above
* *cuTensorNet* supports NVIDIA HPC SDK 21.7 or above

*Limitation notes*:

* This release is optimized for NVIDIA A100 and V100 GPUs.