Release Notes¶

cuStateVec v1.9.0¶

Improved performance/functionality:
- Improved the performance of the following APIs on the Blackwell architecture:
  - custatevecApplyMatrixBatched(), custatevecComputeExpectation(), and custatevecComputeExpectationBatched() for 1-4 qubit matrices.
  - custatevecCollapseByBitString() for general cases.
- Resolved issue:
  - Fixed an issue that compute sanitizer could hang when custatevecApplyMatrix() is called with floating point emulation enabled.

cuStateVec v1.8.0¶

This release introduces support for the Blackwell GPU family.
Added new API:
- Math mode API to handle floating point emulation (see math mode)
Resolve issues:
- Fix for issues of custatevecComputeExpectation() and custatevecComputeExpectationBatched() that they can fail with the “misaligned address” error for the case host pointers are passed as output expectation values.

cuStateVec v1.7.0¶

Added new API:
- Distributed index bit swap API using semaphore for synchronization (see custatevecSVSwapWorkerCreateWithSemaphore() and custatevecSVSwapWorkerSetSubSVsP2PWithSemaphores())
Improved performance/functionality:
- Improved the performance of custatevecApplyGeneralizedPermutationMatrix() for the case where the input matrix is diagonal and its elements are 1.
- Improved the performance of custatevecSVSwapWorkerExecute() when called with MPI.
- Updated custatevecComputeExpectationBatched() to support device pointers for output expectation values.
Other changes:
- Dropped support for Linux ppc64le.

cuStateVec v1.6.0¶

Added new API:
- Expectation values for batched state vectors (see custatevecComputeExpectationBatched())
Improved performance/functionality:
- Reduced API execution latencies in custatevecComputeExpectation(). Performance may improve for 1-4 qubits observables.
Resolved issue:
- Fixed an issue that custatevecBatchMeasure() could return incorrect results with nIndexBits > 20.

Compatibility notes:

cuQuantum will drop support for RHEL 7 in the following cuQuantum release. Please plan ahead with this in mind. Thank you.

cuStateVec v1.5.0¶

Added new API:
- Migration of sub state vectors (see Host state vector migration)
Improved performance/functionality:
- Improved the performance of custatevecApplyPauliRotation().
Resolved issues:
- Fixed an issue that custatevecMultiDeviceSwapIndexBits() accepted invalid index bit positions specified to the indexBitSwaps argument.

cuStateVec v1.4.1¶

Resolve issues:
- Fix an issue that custatevecApplyMatrix() is not asynchronously executed when applying 6 qubit gate matrices to a state vector of the double complex datatype.
- Fix an issue that custatevecMeasureBatched() can fail on NVIDIA H100 with the “illegal instruction” error.

cuStateVec v1.4.0¶

Add new APIs:
- Gate application for batched state vectors (see custatevecApplyMatrixBatched())
- Measurement for batched state vectors (see custatevecAbs2SumArrayBatched(), custatevecCollapseByBitStringBatched(), custatevecMeasureBatched())
- State vector initialization to typical states (see custatevecInitializeStateVector())
Resolve issues:
- Fix for an issue on the Hopper Architecture wherein custatevecApplyMatrix() could produce incorrect results with nIndexBits + nControls = 12, nTargets = 5, and svDataType = CUDA_C_32F.

Compatibility notes:

cuStateVec supports Ubuntu 20.04+.

cuStateVec v1.3.0¶

Add new API:
- Optimized state vector element swap algorithm on distributed state vectors (see the Distributed Index Bit Swap section.)
Improve performance/functionality:
- Improved the performance of 5-qubit gate application with single precision and 6-qubit gate application with double precision in custatevecApplyMatrix() on the Hopper Architecture.
- CUDA Lazy Loading is supported. This can significantly reduce memory footprint by deferring the loading of needed GPU kernels to the first call sites. This feature requires CUDA 11.8 (or above). Please refer to the CUDA documentation for other requirements and details. Currently this feature requires users to opt in by setting the environment variable CUDA_MODULE_LOADING=LAZY. In a future CUDA version, lazy loading may become the default.
Resolve issues:
- Fix for an issue of custatevecMultiDeviceSwapIndexBits() that CUDA calls are not correctly ordered on streams allocated on multiple GPUs.
Other changes:
- Introduce support for CUDA 12.
- A set of new wheels with suffix -cu12 are released on PyPI.org for CUDA 12 users.
  - Example: pip install custatevec-cu12 for installing cuStateVec compatible with CUDA 12
  - The existing cuquantum wheel (without the -cuXX suffix) is turned into an automated installer that will attempt to detect the current CUDA environment and install the appropriate wheels. Please note that this automated detection may encounter conditions under which detection is unsuccessful, especially in a CPU-only environment (such as CI/CD). If detection fails we assume that the target environment is CUDA 11 and proceed. This assumption may be changed in a future release, and in such cases we recommend that users explicitly (manually) install the correct wheels.

Compatibility notes:

cuStateVec requires CUDA 11.x or 12.x.
cuStateVec supports Ubuntu 18.04+
- In the next release, Ubuntu 18.04 will be dropped. The minimum supported Ubuntu version will be 20.04.

cuStateVec v1.2.0¶

We are on NVIDIA/cuQuantum GitHub Discussions! For any questions regarding (or exciting works built upon) cuQuantum, please feel free to reach out to us on GitHub Discussions.
- Bug reports should still go to our GitHub issue tracker.
This release introduces support for the Hopper GPU family.
Improve performance/functionality:
- Improve the performance of 4- and 5-qubit gate application in custatevecApplyMatrix().
- Update custatevecSamplerSample() to accept more than 40-qubit state vectors.
- Add CUSTATEVEC_STATUS_DEVICE_ALLOCATOR_ERROR for errors related to the user-provided device memory handler.
Resolve issues:
- Fix for an issue that custatevecMultiDeviceSwapIndexBits() can return wrong results for 32-qubit or larger state vectors.
- Fix register spilling in custatevecApplyGeneralizedPermutationMatrix() and custatevecComputeExpectationsOnPauliBasis().
Other changes:
- A conda package is released on conda-forge: conda install -c conda-forge custatevec. Users can still obtain both cuStateVec and cuTensorNet with conda install -c conda-forge cuquantum, as before.
- A pip wheel is released on PyPI: pip install custatevec-cu11. Users can still obtain both cuStateVec and cuTensorNet with pip install cuquantum, as before.
  - Currently, the cuquantum meta-wheel points to the cuquantum-cu11 meta-wheel (which then points to custatevec-cu11 and cutensornet-cu11 wheels). This may change in a future release when a new CUDA version becomes available. Using wheels with the -cuXX suffix is encouraged.

cuStateVec v1.1.0¶

Add new API:
- Optimized state vector element swap algorithm on multiple GPUs (see custatevecMultiDeviceSwapIndexBits())
Improve performance/functionality:
- Performance improvements of custatevecApplyMatrix() for 4- and 5-qubit gate application with complex 128
- Performance improvements of custatevecApplyGeneralizedPermutationMatrix() for 7-qubit or larger diagonal gate application
Resolve issues:
- Fix for issues that custatevecComputeExpectation() and custatevecApplyGeneralizedPermutationMatrix() can return wrong results for 32-qubit or larger state vectors.
- Fix a glibc symbol issue that disallowed the release of the cuquantum ppc64le package on conda-forge.

Compatibility notes:

cuStateVec requires CUDA 11.x

Limitation notes:

custatevecMultiDeviceSwapIndexBits() could cause segmentation fault in case a device doesn’t have peer-to-peer (P2P) access to another one. When segmentation faults occur during the API call, please check if direct access between any pair of devices is enabled by cudaDeviceEnablePeerAccess.
custatevecMultiDeviceSwapIndexBits() could return CUSTATEVEC_STATUS_INVALID_VALUE if a handle created on the current device is not provided. Please refer to custatevecMultiDeviceSwapIndexBits() for the details.
CUSTATEVEC_STATUS_INTERNAL_ERROR might be returned if a wrong device pointer is passed to the functions. If a function returns CUSTATEVEC_STATUS_INTERNAL_ERROR, please check if a correct pointer is passed and the size is correctly specified.

cuStateVec v1.0.0¶

Improve performance/functionality:
- Gate application APIs are reoptimized:
  - custatevecApplyMatrix() reduced API execution latencies. Performance with small state vectors may improve for 1-4 qubits matrix application in single precision and 1-5 qubits matrix application in double precision, respectively.
  - custatevecApplyGeneralizedPermutationMatrix() reduced API execution latencies. Performance with small state vectors may improve for diagonal matrix cases.
Resolve issues:
- Multi-threading issues in custatevecApplyMatrix(), custatevecApplyGeneralizedPermutationMatrix(), and custatevecComputeExpectationsOnPauliBasis() are fixed. All the cuStateVec APIs in this version are thread safe as long as each host thread has its own cuStateVec handle.
Add new API:
- Binding a user-provided, stream-ordered memory pool to the library (see the introduction for Workspace and Memory management API for detail).
- Extensions for the batch-measure and sampler APIs to accept state vector partitions across multiple GPUs (see custatevecBatchMeasureWithOffset(), custatevecSamplerGetSquaredNorm(), custatevecSamplerApplySubSVOffset())
- Optimized state vector element swap algorithm on single GPU (see custatevecSwapIndexBits())
- Testing whether a given matrix is Hermitian or unitary (see custatevecTestMatrixType())
- Setting a logger callback with user-provided data (see custatevecLoggerSetCallbackData())

API breaking changes:

The sampler and accessor descriptors are made completely opaque, just like the library handle custatevecHandle_t. For both descriptors there is a corresponding destructor API. Also, they are now passed by value in various routines. Now the C and Python APIs are unified.

Some APIs are renamed as follows:

previous version (< 1.0.0)	new version (= 1.0.0)
custatevecApplyMatrix_bufferSize	`custatevecApplyMatrixGetWorkspaceSize()`
custatevecApplyExp	`custatevecApplyPauliRotation()`
custatevecApplyGeneralizedPermutationMatrix_bufferSize	`custatevecApplyGeneralizedPermutationMatrixGetWorkspaceSize()`
custatevecExpectation_bufferSize	`custatevecComputeExpectationGetWorkspaceSize()`
custatevecExpectation	`custatevecComputeExpectation()`
custatevecExpectationsOnPauliBasis	`custatevecComputeExpectationsOnPauliBasis()`
custatevecSampler_create	`custatevecSamplerCreate()`
custatevecSampler_preprocess	`custatevecSamplerPreprocess()`
custatevecSampler_sample	`custatevecSamplerSample()`
custatevecAccessor_create	`custatevecAccessorCreate()`
custatevecAccessor_createReadOnly	`custatevecAccessorCreateView()`
custatevecAccessor_setExtraWorkspace	`custatevecAccessorSetExtraWorkspace()`
custatevecAccessor_set	`custatevecAccessorSet()`
custatevecAccessor_get	`custatevecAccessorGet()`

The arguments of the following APIs are reordered/renamed:

Compatibility notes:

cuStateVec requires CUDA 11.x

Limitation notes:

CUSTATEVEC_STATUS_INTERNAL_ERROR might be returned if a wrong device pointer is passed to the functions. If a function returns CUSTATEVEC_STATUS_INTERNAL_ERROR, please check if a correct pointer is passed and the size is correctly specified.

cuStateVec v0.1.1¶

Support for the NVIDIA cuQuantum Appliance (see here):
- Extensions for the batch-measure and sampler APIs to accept state vector partitions across multiple GPUs
- Optimized state vector element swap algorithm for multiple GPUs
- Note: the multi-GPU features & optimizations are currently available only in the cuQuantum Appliance

cuStateVec v0.1.0¶

Add support for Linux ppc64le
Add new APIs:
- Gate application for generalized permutation matrices
- Expectation values of Pauli strings
- Accessor to get/set state vector elements

Compatibility notes:

cuStateVec requires CUDA 11.4 or above
cuStateVec requires NVIDIA HPC SDK 21.11 or above

Limitation notes:

CUSTATEVEC_STATUS_INTERNAL_ERROR might be returned if a wrong device pointer is passed to the functions. If a function returns CUSTATEVEC_STATUS_INTERNAL_ERROR, please check if a correct pointer is passed and the size is correctly specified.

cuStateVec v0.0.1¶

Initial release
Support Linux x86_64, Linux Arm64
Support Volta and Ampere architectures (compute capability 7.0+)

Compatibility notes:

cuStateVec requires CUDA 11.4 or above
cuStateVec requires NVIDIA HPC SDK 21.7 or above

Limitation notes:

This release is optimized for NVIDIA A100 and V100 GPUs.
CUSTATEVEC_STATUS_INTERNAL_ERROR might be returned if a wrong device pointer is passed to the functions. If a function returns CUSTATEVEC_STATUS_INTERNAL_ERROR, please check if a correct pointer is passed and the size is correctly specified.
Performance optimization is planned in future releases.