Release Notes#

cuTensorNet v2.12.2#

Bugs fixed:
- Fixed a correctness issue where cutensornetExpectationComputeWithGradientsBackward() would produce incorrect gradients when the tensor network operator expansion contains terms with complex-valued coefficients. This did not affect operators with purely real coefficients.

cuTensorNet v2.12.1#

Bugs fixed:
- Fixed a correctness bug where cutensornetStateApplyTensorOperator() with adjoint=1 applied element-wise complex conjugation instead of the documented Hermitian conjugate (complex conjugation with ket/bra mode transposition).

cuTensorNet v2.12.0#

New functionalities:
- New API cutensornetExpectationComputeWithGradientsBackward() for computing the tensor network state expectation value \(E = \langle\psi|H|\psi\rangle\) and its gradients \(\partial E/\partial G_j\) with respect to tensor operators \(G_j\) in a single call. Tensor operators that participate in gradient computation must be applied via cutensornetStateApplyTensorOperatorWithGradient() before calling cutensornetExpectationPrepare(). After prepare, call cutensornetExpectationComputeWithGradientsBackward() (instead of cutensornetExpectationCompute()) with SCRATCH workspace set; gradients are written to the device buffers specified at apply time.
  
  Limitations:
  - Distributed execution is not supported. Using cutensornetExpectationComputeWithGradientsBackward() in a distributed (e.g. MPI) context will return an error.
  - All gates in the circuit must be unitary. If any gate is non-unitary, cutensornetExpectationPrepare() or cutensornetExpectationComputeWithGradientsBackward() returns CUTENSORNET_STATUS_NOT_SUPPORTED.
  - The state norm and state norm adjoint arguments must be NULL; otherwise CUTENSORNET_STATUS_NOT_SUPPORTED is returned.
- New Projection MPS APIs:
  - cutensornetStateProjectionMPSUpdateCoefficients() allows updating the scalar coefficients of the linear superposition of tensor network states after the projection MPS instance has been created, enabling iterative algorithms where coefficients change over time.
  - cutensornetStateProjectionMPSUpdateDualTensors() allows updating the data pointers for all dual MPS tensors after the projection MPS instance has been created.
- Bugs fixed:
  - Fixed a possible out-of-bounds access in cutensornetGetOutputStateDetails().
  - Fixed a functional bug for State API MPS simulations using CUTENSORNET_STATE_MPS_GAUGE_SIMPLE gauge option when MPS tensors with overcomplete extents are provided.

Compatibility notes:

The strides of MPS output state tensors returned by cutensornetGetOutputStateDetails() may differ from those returned in previous releases. The library does not guarantee any particular stride ordering (e.g., row-major or column-major) for output state tensors, and the returned strides may vary across releases, configurations, or successive calls. Callers should always use the strides returned by cutensornetGetOutputStateDetails() when accessing output tensor data rather than assuming a specific memory layout.

cuTensorNet requires cuTENSOR v2.5.0 or above.

cuTensorNet v2.11.0#

New functionalities:
- New tensor network state API cutensornetStateApplyDiagonalTensorOperator() for applying diagonal tensor operators. This feature works for both Exact TN and MPS simulations.
Bugs fixed:
- Fixed a bug where cutensornetStateCompute() may return incorrect results for exact MPS simulation on qudit systems (qudits with different extents on different modes) with operators acting on non-adjacent qudits.
Performance enhancements:
- Improved performance of cutensornetExpectationPrepare() for network operators that contain a large number of full one body product terms.

cuTensorNet v2.10.1#

Bugs fixed:
- Fixed a bug where cutensornetStatePrepare() may return insufficient workspace required by cutensornetStateCompute() when MPS simulation is performed with fixed bond truncation.
- Fixed a bug where cutensornetNetworkAutotuneContraction() may fail with “insufficient workspace” error even with enough workspace provided.
- Fixed a bug where cutensornetStateCompute() may fail with “internal error” due to insufficient workspace.
- Fixed a bug where cutensornetContractionOptimize() may cause a segmentation fault for certain networks where some modes have extent 1.

Known issues:

In certain MPS simulation with value based truncation, the workspace size returned by cutensornetStatePrepare() may be lower than the actual required workspace size.

cuTensorNet v2.10.0#

New functionalities:
- Enabled network construction and path generation suites of APIs to run on machines with no GPU device, all dependencies remain the same.

Known issues:

Applying two-qudit gates to non-adjacent, non-homogeneous qudits in MPS simulations with exact SVD enabled may cause representational errors or undefined behavior.

cuTensorNet v2.9.1#

Bugs fixed:
- Fixed a bug where cutensornetStatePrepare() may return insufficient workspace required by cutensornetStateCompute() when MPS simulation is performed with value based truncation.
- Fixed a bug where using device memory handler with cutensornetNetworkComputeGradientsBackward() may fail to grab enough scratch memory from the memory pool for large networks.

cuTensorNet v2.9.0#

New functionalities:
- Introduced the network-centric API for tensor network definition and execution with lighter signatures and simpler semantics:
  - New flow: define a network with cutensornetCreateNetwork, append inputs via cutensornetNetworkAppendTensor, set output with cutensornetNetworkSetOutputTensor, then prepare and run using cutensornetNetworkPrepareContraction and cutensornetNetworkContract.
  - Gradient workflow is explicitly separated: set adjoint and per-input gradient buffers via cutensornetNetworkSetAdjointTensorMemory and cutensornetNetworkSetGradientTensorMemory, prepare with cutensornetNetworkPrepareGradientsBackward, and execute with cutensornetNetworkComputeGradientsBackward.
  - Workspace management for the network flow is unified via cutensornetWorkspaceSetMemory with CUTENSORNET_WORKSPACE_SCRATCH and CUTENSORNET_WORKSPACE_CACHE kinds.
  - The optimizer round-trip is maintained: create config/info, call cutensornetContractionOptimize, and attach to the network via cutensornetNetworkSetOptimizerInfo.
- The legacy descriptor/plan-based API remains available but is deprecated. Prefer the network-centric API moving forward.
Bugs fixed:
- Fixed a bug for repeated, subsequent calls to cutensornetStateSetInitialMPS with the same cutensornetState_t object which led to a segmentation fault.
- Fixed a bug in cutensornetCreateStateProjectionMPS() which led to an integer overflow error for states in Hilbert spaces larger than \(2^{64}\).
- Fixed a bug in cutensornetStateUpdateTensorOperator() when applied to controlled tensor operators when the underlying computation is not based on matrix product states.
- Fixed a memory corruption in cutensornetStateCompute() for certain matrix product states simulation with CUTENSORNET_STATE_MPS_GAUGE_SIMPLE gauge option.
- Fixed a bug in cutensornetNetworkComputeGradientsBackward() where an illegal memory access was encountered when a gradient tensor memory size is less than 256 bytes.
Known issues:
- For networks with symmetric tensors (includes duplicate modes), cuTensorNet does not update the off diagonal elements when computing the gradient of that tensor. As such, the user must pre-set the gradient memory to 0 for that tensor ahead of calling cutensornetNetworkComputeGradientsBackward().
- In rare cases, using device memory handler with cutensornetNetworkComputeGradientsBackward() may fail to grab enough scratch memory from the memory pool for large networks. The user is advised to provide scratch memory buffers directly to cutensornetWorkspaceSetMemory() in such case.
Other changes:
- Contraction optimizer attribute CUTENSORNET_OPTIMIZER_COST_TIME_TUNED falls back to the behaviour of CUTENSORNET_OPTIMIZER_COST_TIME.
- Enabled support for symmetric projection states in cutensornetCreateStateProjectionMPS().
- Enabled support for cache workspace for State projection MPS API, providing improved performance for repeated calls to cutensornetStateProjectionMPSComputeTensorEnv() for the same region.

Compatibility notes:

cuTensorNet requires cuTENSOR v2.3.1 or above.

cuTensorNet v2.8.0#

New functionalities:
- Users can now update the target data of a multi-controlled tensor operator using cutensornetStateUpdateTensorOperator() and recompute.
- New Projection Matrix Product State API for creating and manipulating a cutensornetStateProjectionMPS_t. This functionality allows to express variational algorithms to approximate quantum circuits, provided in the form of cutensornetState_t, as matrix product states.
Bugs fixed:
- Fixed a bug in the sampling functionalities cutensornetSamplerSample() where setting CUTENSORNET_SAMPLER_CONFIG_DETERMINISTIC can lead to incorrect results for large tensor network state.
- Fixed a bug for general einsum expression and for network that contains hyperedges indices that are part of the output tensor. The bug manifest as CUTENSORNET_STATUS_ALL_HYPER_SAMPLES_FAILED and when logging is enabled, it will shows it as an exception EXCEPTION: internal error einsum.

cuTensorNet v2.7.0#

New functionalities:
- New attribute CUTENSORNET_STATE_CONFIG_MPS_GAUGE_OPTION which allows user to choose between different gauge options offered in the new enum cutensornetStateMPSGaugeOption_t for Matrix Product State (MPS) simulations. Users are recommended to take advantage of the simple update algorithm via CUTENSORNET_STATE_MPS_GAUGE_SIMPLE for improved accuracy whenever supported.
- New API cutensornetStateApplyGeneralChannel() which allows user to apply a general channel to the tensor network state. Subsequent computation can be performed using the MPS approach with CUTENSORNET_STATE_MPS_GAUGE_FREE method. Note that contraction based tensor network simulation is currently not supported with this API.
Bugs fixed:
- Memory leak on specific hardware.
Performance enhancements:
- Improved performance of the sampling functionalities cutensornetSamplerSample(). On average, more than 5X speedup was observed on many problems.
Other changes:
- Support for Blackwell architecture.

Known issues:

cutensornetStateFinalizeMPS() currently does not support MPS simulation for systems with the state dimension being 1.

Compatibility notes:

cuTensorNet requires cuTENSOR v2.2.0 or above.

cuTensorNet v2.6.0#

New functionalities:
- New API cutensornetStateCaptureMPS() which allows user to reset the tensor network state to the MPS state previously computed via cutensornetStateCompute().
- New API cutensornetStateApplyUnitaryChannel() which allows user to apply a unitary channel to the tensor network state. Subsequent computation can be performed using either contraction based approach or MPS approach, just as usual.
Bugs fixed:
- Fixed a bug in large network when workspace size exceed int32 limit. In the previous versions, when such situation occurs, the code returns with an error “no path can be found”. The fix will allow the pathfinder to continue to try to find a path with slicing.
- Fixed a potential deadlock in the pathfinder ThreadPool.
Other changes:
- Improved the accuracy of MPS simulations using cutensornetState_t object when an initial MPS is set using the cutensornetStateInitializeMPS() API.
- Dropped support for Linux ppc64le.

cuTensorNet v2.5.0#

New functionalities:
- New attribute CUTENSORNET_SAMPLER_CONFIG_DETERMINISTIC which, when set to a positive value, makes cutensornetStateSampler_t produce the same results upon each run.
Bugs fixed:
- Fixed a bug in large network planning leading to a segfault.
- Fixed a bug where gradient computation may fail due to insufficient workspace.
- Fixed a corner case bug where contraction results may be incorrect.
- Fixed failed contractions with compute type CUTENSORNET_COMPUTE_3XTF32.
- Fixed the false GPU workspace shortage failure that could show up for certain quantum circuits.
- Fixed the ability to call non-compute API with a different active GPU device than the one associated with the library handle.
- Fixed the check on tensor operator mutability during the tensor operator data update operation.
- Fixed a corner case bug where illegal memory access was encountered.
- Fixed an issue where the cutensornetState_t object in MPS simulations would throw an error when cutensornetStateCompute() was called more than once after cutensornetStatePrepare() step, even if there were no structural changes. Users can now call cutensornetStateCompute() multiple times on the same cutensornetState_t object with just one preceding cutensornetStatePrepare() step without encountering errors.
Performance enhancements:
- Improved performance of contraction and gradient computations (cutensornetContractSlices() and cutensornetComputeGradientsBackward respectively) when the user allocates workspace buffers on the host memory.
Other changes:
- Lowered CPU overhead of several cuTensorNet functions.
- Enabled support of tensor operators using real data types (CUDA_R_32F and CUDA_R_64F) in the State API.

Compatibility notes:

cuTensorNet requires cuTENSOR v2.0.2 or above.

cuTensorNet v2.4.0#

New functionalities:
- New tensor network state API cutensornetNetworkOperatorAppendMPO() for defining the Matrix Product Operator (MPO) inside a tensor network operator object, which can then be applied to a quantum circuit state via the new API cutensornetStateApplyNetworkOperator(). Users may also use existing APIs centered around cutensornetStateExpectation_t to compute the expectation value of the MPO for a given quantum circuit state.
- New tensor network state APIs CUTENSORNET_STATE_CONFIG_MPS_MPO_APPLICATION and cutensornetStateMPOApplication_t for more options regulating the application of an MPO to an MPS quantum state.
- New tensor network state API cutensornetStateInitializeMPS() for specifying the initial quantum circuit state as an MPS that will undergo a subsequent application of tensor operators.
- New tensor network state API cutensornetStateApplyControlledTensorOperator() for applying controlled and multi-controlled single-target tensor gates.
- New tensor network state APIs cutensornetStateGetInfo(), cutensornetAccessorGetInfo(), cutensornetExpectationGetInfo(), cutensornetMarginalGetInfo() and cutensornetSamplerGetInfo() that allow users to query information for the corresponding objects. Currently, these APIs can be used in conjunction with corresponding new enum values CUTENSORNET_STATE_INFO_FLOPS, CUTENSORNET_ACCESSOR_INFO_FLOPS, CUTENSORNET_EXPECTATION_INFO_FLOPS, CUTENSORNET_MARGINAL_INFO_FLOPS and CUTENSORNET_SAMPLER_INFO_FLOPS to get the flops for the computation.
- New compute type was introduced (CUTENSORNET_COMPUTE_3XTF32), which offers better precision than (CUTENSORNET_COMPUTE_TF32).
Bugs fixed:
- Fix the failing gradient computation when the involved input tensors are conjugated.
- Fix a bug that leads to CUTENSORNET_STATE_MPS_SVD_CONFIG_DISCARDED_WEIGHT_CUTOFF being ignored during the MPS state preparation/computation if cutensornetStateFinalizeMPS() is called after cutensornetStateConfigure().

Other changes:

A new API, cutensornetStateApplyTensorOperator(), replaces the cutensornetStateApplyTensor() API, which is deprecated and will be removed in a future release.
A new API, cutensornetStateUpdateTensorOperator(), replaces the cutensornetStateUpdateTensor() API, which is deprecated and will be removed in a future release.

Several enum values are introduced, while deprecating old enum values, for better consistency:

Old	New
`CUTENSORNET_STATE_MPS_CANONICAL_CENTER`	`CUTENSORNET_STATE_CONFIG_MPS_CANONICAL_CENTER`
`CUTENSORNET_STATE_MPS_SVD_CONFIG_ABS_CUTOFF`	`CUTENSORNET_STATE_CONFIG_MPS_SVD_ABS_CUTOFF`
`CUTENSORNET_STATE_MPS_SVD_CONFIG_REL_CUTOFF`	`CUTENSORNET_STATE_CONFIG_MPS_SVD_REL_CUTOFF`
`CUTENSORNET_STATE_MPS_SVD_CONFIG_S_NORMALIZATION`	`CUTENSORNET_STATE_CONFIG_MPS_SVD_S_NORMALIZATION`
`CUTENSORNET_STATE_MPS_SVD_CONFIG_ALGO`	`CUTENSORNET_STATE_CONFIG_MPS_SVD_ALGO`
`CUTENSORNET_STATE_MPS_SVD_CONFIG_ALGO_PARAMS`	`CUTENSORNET_STATE_CONFIG_MPS_SVD_ALGO_PARAMS`
`CUTENSORNET_STATE_MPS_SVD_CONFIG_DISCARDED_WEIGHT_CUTOFF`	`CUTENSORNET_STATE_CONFIG_MPS_SVD_DISCARDED_WEIGHT_CUTOFF`
`CUTENSORNET_STATE_NUM_HYPER_SAMPLES`	`CUTENSORNET_STATE_CONFIG_NUM_HYPER_SAMPLES`
`CUTENSORNET_ACCESSOR_OPT_NUM_HYPER_SAMPLES`	`CUTENSORNET_ACCESSOR_CONFIG_NUM_HYPER_SAMPLES`
`CUTENSORNET_EXPECTATION_OPT_NUM_HYPER_SAMPLES`	`CUTENSORNET_EXPECTATION_CONFIG_NUM_HYPER_SAMPLES`
`CUTENSORNET_MARGINAL_OPT_NUM_HYPER_SAMPLES`	`CUTENSORNET_MARGINAL_CONFIG_NUM_HYPER_SAMPLES`
`CUTENSORNET_SAMPLER_OPT_NUM_HYPER_SAMPLES`	`CUTENSORNET_SAMPLER_CONFIG_NUM_HYPER_SAMPLES`

The old enum values still exist and are functional, but they are considered deprecated and subject to removal in a future release.

Compatibility notes:

cuTensorNet requires cuTENSOR v2.0.1 or above.
cuQuantum will drop support for RHEL 7 in the following cuQuantum release. Please plan ahead with this in mind. Thank you.

Known issues:

For the MPS computations based on cutensornetStateFinalizeMPS() APIs, if the state has different extents on different modes and there are operators applied to two non-adjacent modes, the exact MPS factorization may not be computed.

cuTensorNet v2.3.0#

New functionalities:
- New tensor network state APIs for defining tensor network operators and computing their expectation values over given tensor network states.
  - See the introduction at Tensor network state specification and processing.
- New tensor network state APIs for computing arbitrary slices of amplitudes for a given tensor network state.
- New tensor network state APIs for computing the Matrix-Product-State (MPS) factorization of a given tensor network state.
- New truncation option CUTENSORNET_TENSOR_SVD_CONFIG_DISCARDED_WEIGHT_CUTOFF for tensor SVD computation.
Bugs fixed:
- Fix a bug when the automatic distributed contraction path optimization is invoked with the TIME as a cost function to ensure the optimal path is chosen.
- Fix a bug for potentially inconsistent library handles management when cutensornetWorkspaceComputeSVDSizes(), cutensornetWorkspaceComputeQRSizes() and cutensornetWorkspaceComputeGateSplitSizes() are called.
- Fix a bug for cutensornetTensorQR() when the combined matrix row/column extent of the input tensor equals 1.
- Fix a performance bug when the tensor network contraction path finder is run by multiple processes on the same node.
Other changes:
- Complex-valued gradients computed with the experimental API cutensornetComputeGradientsBackward are complex conjugated now, as compared to what would be returned in the previous release.

Compatibility notes:

cuTensorNet requires cuTENSOR v1.6.1 or above, but is not compatible with v2.x.y. cuTENSOR v1.7.0 is recommended, for performance improvements, bug fixes, and the CUDA Lazy Loading support.

cuTensorNet v2.2.1#

Bugs fixed:
- Fix a regression leading to a “not supported” error for unary (single-operand) contractions.

cuTensorNet v2.2.0#

New functionalities:
- New experimental API cutensornetComputeGradientsBackward for computing gradients of a tensor network w.r.t. its input tensors.
  - Known limitations: operates on tensor networks with a single slice and no singleton modes, on a single GPU device.
- New tensor network state APIs for facilitating definition of tensor network states, computing arbitrary marginal distributions and performing sampling of those states (support of arbitrary tensor states is provided, for example, qudit-based tensor states).
  - See the introduction at Tensor network state specification and processing.
- New API cutensornetWorkspacePurgeCache() to purge workspace cache.
- New APIs to set/get network attributes.
- New APIs to support more SVD algorithms including GESVD (default), GESVDJ, GESVDP and GESVDR. The SVD algorithm can be set via one call to cutensornetTensorSVDConfigSetAttribute() with the attribute CUTENSORNET_TENSOR_SVD_CONFIG_ALGO. For GESVDJ and GESVDR, user may further set algorithm specific parameters with the attribute CUTENSORNET_TENSOR_SVD_CONFIG_ALGO_PARAMS using new structs cutensornetGesvdjParams_t and cutensornetGesvdrParams_t, respectively.
- New APIs to provide more runtime information for SVD execution in cutensornetTensorSVDInfo_t. The SVD algorithm used can be accessed via one call to cutensornetTensorSVDInfoGetAttribute() with the attribute CUTENSORNET_TENSOR_SVD_INFO_ALGO. For GESVDJ and GESVDP, user may further query execution status with the attribute CUTENSORNET_TENSOR_SVD_INFO_ALGO_STATUS using new structs cutensornetGesvdjStatus_t and cutensornetGesvdpStatus_t, respectively.
- New API, via the CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_CACHE_REUSE_NRUNS attribute, that enables the optimizer to factor in the constant input tensors benefit when a path is run multiple times.
- New API to toggle “smart” optimization settings with attribute CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_SMART_OPTION. This option (turned on by default) will limit the pathfinder elapsed time by avoiding certain configurations as well as adjusting configuration on the fly. The path quality can differ from when the option is turned off. To restore the previous behavior, users should set this to off.
Performance enhancements:
- Improved performance of the contraction path optimization process (e.g., pathfinding). Speedup depends on the tensor network size. A large speedup >10x can be observed for large networks. For medium-size networks (hundreds of tensors) a speedup of almost 5x was observed.
Bugs fixed:
- Failed tensor network contraction involving constant input tensors, in some corner cases, when not enough cache memory was available.

Compatibility notes:

cuTensorNet supports Ubuntu 20.04+.

Known issues:

SVD algorithm CUTENSORNET_TENSOR_SVD_ALGO_GESVDJ may exhibit lower accuracy than default CUTENSORNET_TENSOR_SVD_ALGO_GESVD.
Execution of cutensornetTensorSVD() and cutensornetGateSplit() may potentially fail for certain input tensor operands when SVD algorithm is set to CUTENSORNET_TENSOR_SVD_ALGO_GESVDR or CUTENSORNET_TENSOR_SVD_ALGO_GESVDP.
Under single precision, when the input tensor/matrix has a low rank, CUTENSORNET_TENSOR_SVD_ALGO_GESVDR based tensor SVD may suffer from reduced accuracy.
When SVD algorithm is set to CUTENSORNET_TENSOR_SVD_ALGO_GESVDP, user is responsible for checking the CUTENSORNET_TENSOR_SVD_CONFIG_ALGO_PARAMS attribute from the SVDInfo object with corresponding struct cutensornetGesvdpStatus_t to monitor the convergence.

cuTensorNet v2.1.0#

New functionalities:
- Support for caching intermediate tensors for subsequent reuse in repeated tensor network contractions. This is a useful feature that results in a substantial speedup when users want to perform more than one execution of a tensor network contraction, where a large fraction of the input tensors stays constant, while the rest update their values. For example, computing amplitudes of individual bit-strings or small batches of bit-strings can benefit from this feature. We provide users with an opportunity to specify which tensors are constant. Subsequently, cuTensorNet will use this information to build internal data structures to cache constant intermediate tensors for their reuse in repeated executions of the tensor network contraction plan. Note that, if all input tensors are marked constant, the output tensor becomes constant as well, thus there is no benefit to contracting the network again, as such, the caching mechanism will not be triggered. Repeated contractions in this case will incur the same execution time.
Bugs fixed:
- Failure of cutensornetTensorQR() when users provide a customized memory pool to compute the QR factorization of double complex data with certain extent combinations.
- Failed autotune in some corner cases with “insufficient workspace” error.
- Failed execution of cutensornetTensorSVD() when all singular values are trimmed out. For cuTensorNet v2.1.0, one singular value will be retained in the output for such cases. This behavior may be subject to change in a future release.
Other changes:
- The cuTensorNet-MPI wrapper library (libcutensornet_distributed_interface_mpi.so) needs to be linked to the MPI library libmpi.so. If you use our conda-forge packages or cuQuantum Appliance container, or compile your own using the provided activate_mpi.sh script, this is taken care for you.
- Introduce support for CUDA 12.
- A set of new wheels with suffix -cu12 are released on PyPI.org for CUDA 12 users.
  - Example: pip install cutensornet-cu12 for installing cuTensorNet compatible with CUDA 12.
  - The existing cuquantum wheel (without the -cuXX suffix) is turned into an automated installer that will attempt to detect the current CUDA environment and install the appropriate wheels. Please note that this automated detection may encounter conditions under which detection is unsuccessful, especially in a CPU-only environment (such as CI/CD). If detection fails we assume that the target environment is CUDA 11 and proceed. This assumption may be changed in a future release, and in such cases we recommend that users explicitly (manually) install the correct wheels.
Performance enhancements:
- CUDA Lazy Loading is supported. This can significantly reduce memory footprint by deferring the loading of needed GPU kernels to the first call sites. This feature requires CUDA 11.8 (or above) and cuTENSOR 1.7.0 (or above). Please refer to the CUDA documentation for other requirements and details. Currently this feature requires users to opt in by setting the environment variable CUDA_MODULE_LOADING=LAZY. In a future CUDA version, lazy loading may become the default.

Compatibility notes:

cuTensorNet requires cuTENSOR 1.6.1 or above, but cuTENSOR 1.7.0 or above is recommended, for performance improvements, bug fixes, and the CUDA Lazy Loading support.
cuTensorNet supports Ubuntu 18.04+
- In the next release, Ubuntu 18.04 will be dropped. The minimum supported Ubuntu version will be 20.04.

cuTensorNet v2.0.0#

We are on NVIDIA/cuQuantum GitHub Discussions! For any questions regarding (or exciting works built upon) cuQuantum, please feel free to reach out to us on GitHub Discussions.
- Bug reports should still go to our GitHub issue tracker.
Major release:
- A conda package is released on conda-forge: conda install -c conda-forge cutensornet. Users can still obtain both cuTensorNet and cuStateVec with conda install -c conda-forge cuquantum, as before.
- A pip wheel is released on PyPI: pip install cutensornet-cu11. Users can still obtain both cuTensorNet and cuStateVec with pip install cuquantum, as before.
  - Currently, the cuquantum meta-wheel points to the cuquantum-cu11 meta-wheel (which then points to cutensornet-cu11 and custatevec-cu11 wheels). This may change in a future release when a new CUDA version becomes available. Using wheels with the -cuXX suffix is encouraged.
New functionalities:
- Initial support for Hopper users. This requires CUDA 11.8.
- New APIs to create, query, and destroy tensor descriptor objects.
- New APIs and functionalities for approximate tensor network algorithms. cuTensorNet now supports the computational primitives mentioned below to enable users to develop approximate tensor network simulators for quantum circuits including MPS, PEPS, and more:
  - Tensor decomposition via QR or SVD. Both exact and truncated SVD supported.
  - Application of a gate to a pair of connected tensors followed by compression.
- New APIs to create, tune, query, and destroy tensor SVD truncation settings.
- New APIs to create, query, and destroy runtime tensor SVD truncation information.
- Automatic distributed execution: cuTensorNet API is extended to include functions enabling automated distributed parallelization of tensor network contractions across multiple GPUs. Once activated, the parallelization is applied to both tensor network contraction path finding (when hyper-sampling is enabled) and contraction execution, without making any changes to the original serial source code.
Functionalities introduced that break previous APIs:
- Complex conjugation operator on input tensors (this adds an extra parameter that specifies tensor qualifiers in cutensornetCreateNetworkDescriptor() API).
- Provide new API for users to specify the slices as (sliced modes, sliced extents) with one call to cutensornetContractionOptimizerInfoSetAttribute() by using the attribute CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_SLICING_CONFIG. This will remove a limitation of the old API, which required users to set the modes first before the extents.
- Removed the [in,out] alignment-requirement parameters from the cutensornetCreateNetworkDescriptor() API. These are no longer required and are being inferred internally.
- Some enum values are reordered. If your application stores any of cuTensorNet enum values as plain int, please make sure to rebuild your application.
Bugs fixed:
- Memory access error when running cuda-memcheck in a few corner cases.
- Logging related bug upon setting some attributes.
- Inaccurate flops computed by cuTensorNet with user-provided path & slicing.
- “Undefined symbol” error when using cuTensorNet in the NVIDIA HPC SDK container.
- Incorrect handling of extent-1 modes in the deprecated cutensornetGetOutputTensorDetails() API.
Performance enhancements:
- Improved performance of the contraction path optimization process. On average, about 3X speedup was observed on many problems.
- Improved performance of the contraction auto-tuning process.
- Improved the quality of the slicing algorithm. We now select the configuration with the minimum number of slices that has the minimal flops overhead.
- More auto-tuning heuristics added that improves tensor contraction performance.
Other changes:
- GNU OpenMP Runtime (gomp) is no longer needed.
- A new API, cutensornetWorkspaceComputeContractionSizes(), replaces the cutensornetWorkspaceComputeSizes() API, which is deprecated and will be removed in a future release.
- Two new APIs, cutensornetGetOutputTensorDescriptor() and cutensornetGetTensorDetails(), replace the cutensornetGetOutputTensorDetails() API, which is deprecated and will be removed in a future release.
- New samples (samples/cutensornet/).

Compatibility notes:

cuTensorNet requires cuTENSOR 1.6.1 or above, but cuTENSOR 1.6.2 or above is recommended, for performance improvements and bug fixes.
cuTensorNet requires CUDA 11.x, but CUDA 11.8 is recommended, for Hopper support, performance improvements, and bug fixes.

Known issues:

With CUDA 11.7 or lower, cutensornetTensorQR() can potentially fail for certain extents.
cutensornetTensorQR() can potentially fail when users provide a customized memory pool to compute the QR factorization of double complex data with certain extent combinations.
With cuTENSOR 1.6.1 and Turing, broadcasting tensor modes with extent-1 might fail in certain cases.

cuTensorNet v1.1.1#

Bugs fixed:
- The version constraint cuTENSOR>=1.5,<2 as promised elsewhere in the documentation was not correctly respected. Both the code and various package sources are now fixed.

cuTensorNet v1.1.0#

New APIs and functionalities introduced:
- A new API, cutensornetContractionOptimizerInfoPackData(), that allows users to serialize/pack the optimizerInfo in order to broadcast it to other ranks. Similarly, another new API for unpacking is provided, cutensornetUpdateContractionOptimizerInfoFromPackedData().
- New APIs for creating and destroying slice group objects, which include cutensornetCreateSliceGroupFromIDRange(), cutensornetCreateSliceGroupFromIDs() and cutensornetDestroySliceGroup(). These APIs, when combined with the packing/unpacking APIs above, allow the users to employ the slicing technique to create independent tasks that be run on multiple GPUs.
- A new API, cutensornetContractSlices(), for the execution of the contraction. This will replace the cutensornetContraction() API, which is deprecated and will be removed in a future release.
- An option to auto-tune intermediate modes through the cutensornetContractionAutotune() API, which helps improve network contraction performance. The functionality of this API call can be controlled with the CUTENSORNET_CONTRACTION_AUTOTUNE_INTERMEDIATE_MODES attribute.
- An option to find a path that minimizes estimated time to solution (rather than FLOP count). This experimental feature can be controlled with the configuration attribute CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_COST_FUNCTION_OBJECTIVE.
- An option to retrieve the mode labels for all intermediate tensors through the CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_INTERMEDIATE_MODES attribute of the contraction optimizer-info.
Functionality/performance improvements:
- Since near optimal paths are easily found for small networks without simplification, and since simplification does not guarantee an optimal path, the simplification phase has been turned OFF by default when the simplified network is sufficiently small.
- A new slicing algorithm has been developed, leading to potentially more efficient slicing solutions.
- Improve contraction performance by optimizing intermediate mode-ordering.
- Improve contraction performance of networks that have many singleton mode labels.
Bugs fixed:
- Previously, in rare circumstances, the slicing algorithm could fail to make progress toward finding a valid solution, resulting in an infinite loop. This has been fixed.
- A bug in the deprecated cutensornetContraction() API that accepted sliceId >= numSlices.
Other changes:
- Provide a distributed (MPI-based) C sample that shows how easy it is to use cuTensorNet and create parallelism.
- Update the (non-distributed) C sample by improving memory usage and employing the new contraction API cutensornetContractSlices().

cuTensorNet v1.0.1#

Bugs fixed:
- A workspace pointer alignment issue.
- A potential path optimizer issue to avoid returning CUTENSORNET_STATUS_NOT_SUPPORTED.
Performance enhancements:
- This release improved the support for generalized einsum expression to provide a better contraction path.
Other changes:
- The Overview and Examples pages are significantly improved!
- Clarify in the documentation and sample that the contraction over slices needs to be done in ascending order, and that when parallelizing over the slices the output tensor should be zero-initialized.
- Clarify in the documentation that the returned FLOP count assumes real-valued inputs.
- Several issues in the C++ sample (samples/cutensornet/tensornet_example.cu) are fixed.

cuTensorNet v1.0.0#

Functionality/performance improvements:
- Greatly reduced the workspace memory size required.
- Reduced the execution time of the pathfinder with multithreading and internal optimization.
- Support for hyperedges in tensor networks.
- Support for tensor networks described by generalized Einstein summation expressions.
Add new APIs and functionalities for:
- Managing workspace (see Workspace management API for details).
- Binding a user-provided, stream-ordered memory pool to the library (see Memory management API for details).
- Query of the output tensor details (see cutensornetGetOutputTensorDetails()).
- Set the number of threads for the hyperoptimizer (see Hyper-optimizer for details).
- Setting a logger callback with user-provided data (see cutensornetLoggerSetCallbackData()).
API changes:
- Replaced cutensornetContractionGetWorkspaceSize with cutensornetWorkspaceComputeSizes().
- cutensornetCreateContractionPlan(), cutensornetContractionAutotune(), and cutensornetContraction() receive a workspace descriptor instead of workspace pointer and size params.
- Renamed cutensornetGraphAlgo_t and cutensornetMemoryModel_t enumerations’ options.

Compatibility notes:

cuTensorNet requires CUDA 11.x.
cuTensorNet requires cuTENSOR 1.5.0 or above.
cuTensorNet requires OpenMP runtime (GOMP).
cuTensorNet no longer requires NVIDIA HPC SDK.

Limitation notes:

If multiple slices are created, the order of contracting over slices using cutensornetContraction() should be ascending starting from slice 0. If parallelizing over slices manually (in any fashion: streams, devices, processes, etc.), please make sure the output tensors (that are subject to a global reduction) are zero-initialized.

cuTensorNet v0.1.0#

Initial public release
Add support for Linux ppc64le
Add new APIs and functionalities for:
- Fine-tuning the slicing algorithm
- Reconfiguring a tensor network
- Simplifying a tensor network
- Optimizing pathfinder parameters using the hyperoptimizer
- Retrieving the optimizer configuration parameters
API changes:
- cutensornetContractionGetWorkspace is renamed to cutensornetContractionGetWorkspaceSize
- cutensornetContractionAutotune()’s function signature has changed

Compatibility notes:

cuTensorNet requires cuTENSOR 1.4.0 or above
cuTensorNet requires NVIDIA HPC SDK 21.11 or above

cuTensorNet v0.0.1#

Initial release
Support Linux x86_64 and Linux Arm64
Support Volta and Ampere architectures (compute capability 7.0+)

Compatibility notes:

cuTensorNet requires CUDA 11.4 or above
cuTensorNet requires cuTENSOR 1.3.3 or above
cuTensorNet supports NVIDIA HPC SDK 21.7 or above

Limitation notes:

This release is optimized for NVIDIA A100 and V100 GPUs.