Release Notes¶
cuTensorNet v1.1.1¶
Bugs fixed:
The version constraint
cuTENSOR>=1.5,<2
as promised elsewhere in the documentation was not correctly respected. Both the code and various package sources are now fixed.
cuTensorNet v1.1.0¶
New APIs and functionalities introduced:
A new API,
cutensornetContractionOptimizerInfoPackData()
, that allows users to serialize/pack the optimizerInfo in order to broadcast it to other ranks. Similarly, another new API for unpacking is provided,cutensornetUpdateContractionOptimizerInfoFromPackedData()
.New APIs for creating and destroying slice group objects, which include
cutensornetCreateSliceGroupFromIDRange()
,cutensornetCreateSliceGroupFromIDs()
andcutensornetDestroySliceGroup()
. These APIs, when combined with the packing/unpacking APIs above, allow the users to employ the slicing technique to create independent tasks that be run on multiple GPUs.A new API,
cutensornetContractSlices()
, for the execution of the contraction. This will replace thecutensornetContraction()
API, which is deprecated and will be removed in a future release.An option to auto-tune intermediate modes through the
cutensornetContractionAutotune()
API, which helps improve network contraction performance. The functionality of this API call can be controlled with theCUTENSORNET_CONTRACTION_AUTOTUNE_INTERMEDIATE_MODES
attribute.An option to find a path that minimizes estimated time to solution (rather than FLOP count). This experimental feature can be controlled with the configuration attribute
CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_COST_FUNCTION_OBJECTIVE
.An option to retrieve the mode labels for all intermediate tensors through the
CUTENSORNET_CONTRACTION_OPTIMIZER_INFO_INTERMEDIATE_MODES
attribute of the contraction optimizer-info.
Functionality/performance improvements:
Since near optimal paths are easily found for small networks without simplification, and since simplification does not guarantee an optimal path, the simplification phase has been turned OFF by default when the simplified network is sufficiently small.
A new slicing algorithm has been developed, leading to potentially more efficient slicing solutions.
Improve contraction performance by optimizing intermediate mode-ordering.
Improve contraction performance of networks that have many singleton mode labels.
Bugs fixed:
Previously, in rare circumstances, the slicing algorithm could fail to make progress toward finding a valid solution, resulting in an infinite loop. This has been fixed.
A bug in the deprecated
cutensornetContraction()
API that accepted sliceId >= numSlices.
Other changes:
Provide a distributed (MPI-based) C sample that shows how easy it is to use cuTensorNet and create parallelism.
Update the (non-distributed) C sample by improving memory usage and employing the new contraction API
cutensornetContractSlices()
.
cuTensorNet v1.0.1¶
Bugs fixed:
A workspace pointer alignment issue.
A potential path optimizer issue to avoid returning
CUTENSORNET_STATUS_NOT_SUPPORTED
.
Performance enhancements:
This release improved the support for generalized einsum expression to provide a better contraction path.
Other changes:
The Overview and Getting Started pages are significantly improved!
Clarify in the documentation and sample that the contraction over slices needs to be done in ascending order, and that when parallelizing over the slices the output tensor should be zero-initialized.
Clarify in the documentation that the returned FLOP count assumes real-valued inputs.
Several issues in the C++ sample (samples/cutensornet/tensornet_example.cu) are fixed.
cuTensorNet v1.0.0¶
Functionality/performance improvements:
Greatly reduced the workspace memory size required.
Reduced the execution time of the pathfinder with multithreading and internal optimization.
Support for hyperedges in tensor networks.
Support for tensor networks described by generalized Einstein summation expressions.
Add new APIs and functionalities for:
Managing workspace (see Workspace Management API for details).
Binding a user-provided, stream-ordered memory pool to the library (see Memory Management API for details).
Query of the output tensor details (see
cutensornetGetOutputTensorDetails()
).Set the number of threads for the hyperoptimizer (see Hyper-optimizer for details).
Setting a logger callback with user-provided data (see
cutensornetLoggerSetCallbackData()
).
API changes:
Replaced
cutensornetContractionGetWorkspaceSize
withcutensornetWorkspaceComputeSizes()
.cutensornetCreateContractionPlan()
,cutensornetContractionAutotune()
, andcutensornetContraction()
receive a workspace descriptor instead of workspace pointer and size params.Renamed
cutensornetGraphAlgo_t
andcutensornetMemoryModel_t
enumerations’ options.
Compatibility notes:
cuTensorNet requires CUDA 11.x.
cuTensorNet requires cuTENSOR 1.5.0 or above.
cuTensorNet requires OpenMP runtime (GOMP).
cuTensorNet no longer requires NVIDIA HPC SDK.
Limitation notes:
If multiple slices are created, the order of contracting over slices using
cutensornetContraction()
should be ascending starting from slice 0. If parallelizing over slices manually (in any fashion: streams, devices, processes, etc.), please make sure the output tensors (that are subject to a global reduction) are zero-initialized.
cuTensorNet v0.1.0¶
Initial public release
Add support for
Linux ppc64le
Add new APIs and functionalities for:
Fine-tuning the slicing algorithm
Reconfiguring a tensor network
Simplifying a tensor network
Optimizing pathfinder parameters using the hyperoptimizer
Retrieving the optimizer config parameters
API changes:
cutensornetContractionGetWorkspace
is renamed tocutensornetContractionGetWorkspaceSize
cutensornetContractionAutotune()
’s function signature has changed
Compatibility notes:
cuTensorNet requires cuTENSOR 1.4.0 or above
cuTensorNet requires NVIDIA HPC SDK 21.11 or above
cuTensorNet v0.0.1¶
Initial release
Support
Linux x86_64
andLinux Arm64
Support Volta and Ampere architectures (compute capability 7.0+)
Compatibility notes:
cuTensorNet requires CUDA 11.4 or above
cuTensorNet requires cuTENSOR 1.3.3 or above
cuTensorNet supports NVIDIA HPC SDK 21.7 or above
Limitation notes:
This release is optimized for NVIDIA A100 and V100 GPUs.