*************
Release Notes
*************

=================
cuStateVec v1.7.0
=================

* Added new API:

  * Distributed index bit swap API using semaphore for synchronization (see `custatevecSVSwapWorkerCreateWithSemaphore` and `custatevecSVSwapWorkerSetSubSVsP2PWithSemaphores`)

* Improved performance/functionality:

  * Improved the performance of `custatevecApplyGeneralizedPermutationMatrix` for the case where the input matrix is diagonal and its elements are 1.
  * Improved the performance of `custatevecSVSwapWorkerExecute` when called with MPI.
  * Updated `custatevecComputeExpectationBatched` to support device pointers for output expectation values.

* Other changes:

  * Dropped support for ``Linux ppc64le``.

=================
cuStateVec v1.6.0
=================

* Added new API:

   * Expectation values for batched state vectors (see `custatevecComputeExpectationBatched`)

* Improved performance/functionality:

   * Reduced API execution latencies in `custatevecComputeExpectation`.
     Performance may improve for 1-4 qubits observables.

* Resolved issue:

   * Fixed an issue that `custatevecBatchMeasure` could return incorrect results with ``nIndexBits > 20``.

*Compatibility notes*:

* cuQuantum will drop support for RHEL 7 in the following cuQuantum release.  Please plan ahead with this in mind. Thank you.

=================
cuStateVec v1.5.0
=================

* Added new API:

   * Migration of sub state vectors (see :doc:`Host state vector migration <./host-state-vector-migration>`)

* Improved performance/functionality:

   * Improved the performance of `custatevecApplyPauliRotation`.

* Resolved issues:

   * Fixed an issue that `custatevecMultiDeviceSwapIndexBits()` accepted invalid index bit positions specified to the indexBitSwaps argument.

=================
cuStateVec v1.4.1
=================

* Resolve issues:

   * Fix an issue that `custatevecApplyMatrix` is not asynchronously executed when applying 6 qubit gate matrices to a state vector of the double complex datatype.
   * Fix an issue that `custatevecMeasureBatched` can fail on NVIDIA H100 with the "illegal instruction" error.

=================
cuStateVec v1.4.0
=================

* Add new APIs:

   * Gate application for batched state vectors (see `custatevecApplyMatrixBatched`)
   * Measurement for batched state vectors (see `custatevecAbs2SumArrayBatched`, `custatevecCollapseByBitStringBatched`, `custatevecMeasureBatched`)
   * State vector initialization to typical states (see `custatevecInitializeStateVector`)

* Resolve issues:

   * Fix for an issue on the Hopper Architecture wherein `custatevecApplyMatrix` could produce incorrect results with ``nIndexBits + nControls = 12``, ``nTargets = 5``, and ``svDataType = CUDA_C_32F``.

*Compatibility notes*:

* *cuStateVec* supports Ubuntu 20.04+.

=================
cuStateVec v1.3.0
=================

* Add new API:

  * Optimized state vector element swap algorithm on distributed state vectors (see the :doc:`Distributed Index Bit Swap<./distributed-index-bit-swap>` section.)

* Improve performance/functionality:

  * Improved the performance of 5-qubit gate application with single precision and 6-qubit gate application with double precision 
    in `custatevecApplyMatrix` on the Hopper Architecture.
  * `CUDA Lazy Loading`_ is supported. This can significantly reduce memory footprint by deferring the loading of needed GPU kernels to the first call sites. This feature requires CUDA 11.8 (or above). Please refer to the CUDA documentation for other requirements and details. Currently this feature requires users to opt in by setting the environment variable ``CUDA_MODULE_LOADING=LAZY``. In a future CUDA version, lazy loading may become the default.

* Resolve issues:

  * Fix for an issue of `custatevecMultiDeviceSwapIndexBits` that CUDA calls are not correctly ordered on streams allocated on multiple GPUs.

* Other changes:

  * Introduce support for CUDA 12.
  * A set of new wheels with suffix ``-cu12`` are released on PyPI.org for CUDA 12 users.

    - Example: ``pip install custatevec-cu12`` for installing cuStateVec compatible with CUDA 12
    - The existing ``cuquantum`` wheel (without the ``-cuXX`` suffix) is turned into an automated installer
      that will attempt to detect the current CUDA environment and install the appropriate wheels. Please note that this automated
      detection may encounter conditions under which detection is unsuccessful, especially in a CPU-only environment (such as CI/CD).
      If detection fails we assume that
      the target environment is CUDA 11 and proceed. This assumption may be changed in a future release, and in such cases we
      recommend that users explicitly (manually) install the correct wheels.

*Compatibility notes*:

* *cuStateVec* requires CUDA 11.x or 12.x.
* *cuStateVec* supports Ubuntu 18.04+

  - In the next release, Ubuntu 18.04 will be dropped. The minimum supported Ubuntu version will be 20.04.

.. _CUDA Lazy Loading: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading

=================
cuStateVec v1.2.0
=================

* We are on `NVIDIA/cuQuantum GitHub Discussions <https://github.com/NVIDIA/cuQuantum/discussions>`_! For any questions regarding (or exciting works built upon) cuQuantum, please feel free to reach out to us on GitHub Discussions.

  * Bug reports should still go to `our GitHub issue tracker <https://github.com/NVIDIA/cuQuantum/issues>`_.

* This release introduces support for the Hopper GPU family.

* Improve performance/functionality:

  * Improve the performance of 4- and 5-qubit gate application in `custatevecApplyMatrix`.
  * Update `custatevecSamplerSample` to accept more than 40-qubit state vectors.
  * Add `CUSTATEVEC_STATUS_DEVICE_ALLOCATOR_ERROR` for errors related to the user-provided device memory handler.

* Resolve issues:

  * Fix for an issue that `custatevecMultiDeviceSwapIndexBits` can return wrong results for 32-qubit or larger state vectors.
  * Fix register spilling in `custatevecApplyGeneralizedPermutationMatrix` and `custatevecComputeExpectationsOnPauliBasis`.

* Other changes:

  * A conda package is released on conda-forge: ``conda install -c conda-forge custatevec``. Users can still obtain both *cuStateVec* and *cuTensorNet* with ``conda install -c conda-forge cuquantum``, as before.

  * A pip wheel is released on PyPI: ``pip install custatevec-cu11``. Users can still obtain both *cuStateVec* and *cuTensorNet* with ``pip install cuquantum``, as before.

    * Currently, the ``cuquantum`` meta-wheel points to the ``cuquantum-cu11`` meta-wheel (which then points to ``custatevec-cu11`` and ``cutensornet-cu11`` wheels). This may change in a future release when a new CUDA version becomes available. Using wheels with the ``-cuXX`` suffix is encouraged.

=================
cuStateVec v1.1.0
=================

* Add new API:

  * Optimized state vector element swap algorithm on multiple GPUs (see `custatevecMultiDeviceSwapIndexBits`)

* Improve performance/functionality:

  * Performance improvements of `custatevecApplyMatrix` for 4- and 5-qubit gate application with complex 128
  * Performance improvements of `custatevecApplyGeneralizedPermutationMatrix` for 7-qubit or larger diagonal gate application

* Resolve issues:

  * Fix for issues that `custatevecComputeExpectation` and `custatevecApplyGeneralizedPermutationMatrix` can return wrong results for 32-qubit or larger state vectors.
  * Fix a glibc symbol issue that disallowed the release of the ``cuquantum`` ppc64le package on conda-forge.

*Compatibility notes*:

* *cuStateVec* requires CUDA 11.x

*Limitation notes*:

* `custatevecMultiDeviceSwapIndexBits` could cause segmentation fault in case a device doesn't have peer-to-peer (P2P) access to another one.
  When segmentation faults occur during the API call, please check if direct access between any pair of devices is enabled by `cudaDeviceEnablePeerAccess`.
* `custatevecMultiDeviceSwapIndexBits` could return `CUSTATEVEC_STATUS_INVALID_VALUE` if a handle created on the current device is not provided.
  Please refer to `custatevecMultiDeviceSwapIndexBits` for the details.
* ``CUSTATEVEC_STATUS_INTERNAL_ERROR`` might be returned if a wrong device pointer is passed to the functions.
  If a function returns ``CUSTATEVEC_STATUS_INTERNAL_ERROR``, please check if a correct pointer is passed and the size is correctly specified.

=================
cuStateVec v1.0.0
=================

* Improve performance/functionality:
  
  * Gate application APIs are reoptimized:

    - `custatevecApplyMatrix` reduced API execution latencies. 
      Performance with small state vectors may improve for 1-4 qubits matrix application in single precision and 1-5 qubits matrix application in double precision, respectively.
    - `custatevecApplyGeneralizedPermutationMatrix` reduced API execution latencies. 
      Performance with small state vectors may improve for diagonal matrix cases.

* Resolve issues:

  * Multi-threading issues in `custatevecApplyMatrix`, `custatevecApplyGeneralizedPermutationMatrix`, and `custatevecComputeExpectationsOnPauliBasis` are fixed.
    All the cuStateVec APIs in this version are thread safe as long as each host thread has its own cuStateVec handle.

* Add new API:
  
  * Binding a user-provided, stream-ordered memory pool to the library (see the introduction for :ref:`workspace-label` and :ref:`cuStateVec memory management API` for detail).
  * Extensions for the batch-measure and sampler APIs to accept state vector partitions across multiple GPUs
    (see `custatevecBatchMeasureWithOffset`, `custatevecSamplerGetSquaredNorm`, `custatevecSamplerApplySubSVOffset`)
  * Optimized state vector element swap algorithm on single GPU (see `custatevecSwapIndexBits`)
  * Testing whether a given matrix is Hermitian or unitary (see `custatevecTestMatrixType`)
  * Setting a logger callback with user-provided data (see `custatevecLoggerSetCallbackData`)

*API breaking changes*:

* The sampler and accessor descriptors are made completely opaque, just like the library handle `custatevecHandle_t`.
  For both descriptors there is a corresponding destructor API.
  Also, they are now passed by value in various routines. Now the C and Python APIs are unified.

* Some APIs are renamed as follows:

  ====================================================== ========================================================================
  previous version (< 1.0.0)                               new version (= 1.0.0)
  ====================================================== ========================================================================
  custatevecApplyMatrix_bufferSize                       `custatevecApplyMatrixGetWorkspaceSize`
  custatevecApplyExp                                     `custatevecApplyPauliRotation`
  custatevecApplyGeneralizedPermutationMatrix_bufferSize `custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize`
  custatevecExpectation_bufferSize                       `custatevecComputeExpectationGetWorkspaceSize`
  custatevecExpectation                                  `custatevecComputeExpectation`
  custatevecExpectationsOnPauliBasis                     `custatevecComputeExpectationsOnPauliBasis`
  custatevecSampler_create                               `custatevecSamplerCreate`
  custatevecSampler_preprocess                           `custatevecSamplerPreprocess`
  custatevecSampler_sample                               `custatevecSamplerSample`
  custatevecAccessor_create                              `custatevecAccessorCreate`
  custatevecAccessor_createReadOnly                      `custatevecAccessorCreateView`
  custatevecAccessor_setExtraWorkspace                   `custatevecAccessorSetExtraWorkspace`
  custatevecAccessor_set                                 `custatevecAccessorSet`
  custatevecAccessor_get                                 `custatevecAccessorGet`
  ====================================================== ========================================================================

* The arguments of the following APIs are reordered/renamed:
  
  * `custatevecApplyMatrix`
  * `custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize`
  * `custatevecApplyGeneralizedPermutationMatrix`
  * `custatevecComputeExpectationsOnPauliBasis`

*Compatibility notes*:

* *cuStateVec* requires CUDA 11.x

*Limitation notes*:

* ``CUSTATEVEC_STATUS_INTERNAL_ERROR`` might be returned if a wrong device pointer is passed to the functions. If a function returns ``CUSTATEVEC_STATUS_INTERNAL_ERROR``, please check if a correct pointer is passed and the size is correctly specified.

=================
cuStateVec v0.1.1
=================

* Support for the NVIDIA cuQuantum Appliance (see :doc:`here <../appliance/index>`):

  * Extensions for the batch-measure and sampler APIs to accept state vector partitions across multiple GPUs
  * Optimized state vector element swap algorithm for multiple GPUs
  * Note: the multi-GPU features & optimizations are currently available only in the cuQuantum Appliance

=================
cuStateVec v0.1.0
=================

* Add support for ``Linux ppc64le``
* Add new APIs:

  * Gate application for generalized permutation matrices
  * Expectation values of Pauli strings
  * Accessor to get/set state vector elements

*Compatibility notes*:

* *cuStateVec* requires CUDA 11.4 or above
* *cuStateVec* requires NVIDIA HPC SDK 21.11 or above

*Limitation notes*:

* ``CUSTATEVEC_STATUS_INTERNAL_ERROR`` might be returned if a wrong device pointer is passed to the functions. If a function returns ``CUSTATEVEC_STATUS_INTERNAL_ERROR``, please check if a correct pointer is passed and the size is correctly specified.

=================
cuStateVec v0.0.1
=================

* Initial release
* Support ``Linux x86_64``, ``Linux Arm64``
* Support Volta and Ampere architectures (compute capability 7.0+)

*Compatibility notes*:

* *cuStateVec* requires CUDA 11.4 or above
* *cuStateVec* requires NVIDIA HPC SDK 21.7 or above

*Limitation notes*:

* This release is optimized for NVIDIA A100 and V100 GPUs.
* ``CUSTATEVEC_STATUS_INTERNAL_ERROR`` might be returned if a wrong device pointer is passed to the functions. If a function returns ``CUSTATEVEC_STATUS_INTERNAL_ERROR``, please check if a correct pointer is passed and the size is correctly specified.
* Performance optimization is planned in future releases.