Overview¶
cuQuantum Python aims to bring the full functionalities of NVIDIA cuQuantum SDK to Python. To do so, we adopt a two-layer approach:
Provide 1:1 Python wrappers of the corresponding C APIs in cuQuantum, including both cuStateVec and cuTensorNet.
Provide high-level, pythonic APIs for easier integration with Python applications.
Below we introduce each layer and show examples for the intended usage. Python sample codes (such as those shown below) can be found in the NVIDIA/cuQuantum repository.
Low-level Python bindings¶
Naming & calling convention¶
All cuQuantum C APIs are exposed under the cuquantum.custatevec
and cuquantum.cutensornet
modules.
In doing so, we follow the PEP 8 style guide and adopt the following changes:
All library name prefixes are stripped
The function names are broken by words and follow the camel case
The first letter in each word in the enum names is capitalized
Each enum’s name prefix is stripped from its value’s names
Common enums that can be used in all submodules are placed in the parent module
cuquantum
Whenever applicable, the outputs are stripped away from the function arguments and returned directly as Python objects
Pointers are passed as Python
int
Below is a non-exhaustive list of examples of such C-to-Python mappings:
Function:
custatevecGetDefaultWorkspaceSize()
->custatevec.get_default_workspace_size()
.Function:
cutensornetCreateNetworkDescriptor()
->cutensornet.create_network_descriptor()
.Enum type:
custatevecMatrixLayout_t
->custatevec.MatrixLayout
.Enum type:
cutensornetContractionOptimizerConfigAttributes_t
->cutensornet.ContractionOptimizerConfigAttribute
.Enum value name:
CUSTATEVEC_MATRIX_LAYOUT_COL
->custatevec.MatrixLayout.COL
.Enum value name:
CUTENSORNET_CONTRACTION_OPTIMIZER_CONFIG_HYPER_NUM_SAMPLES
->cutensornet.ContractionOptimizerConfigAttribute.HYPER_NUM_SAMPLES
.Return: The outputs of
custatevecSamplerCreate()
are the sampler descriptor and the required workspace size, which are wrapped as a 2-tuple in the correspondingcustatevec.sampler_create()
Python API.Global enum:
custatevecComputeType_t
andcutensornetComputeType_t
->cuquantum.ComputeType
.
There may be exceptions for the above rules, but they would be self-evident and properly documented. In the next section we discuss pointer passing in Python.
Memory management¶
Pointer and data lifetime¶
Unlike in C/C++, Python does not provide low-level primitives to allocate/deallocate host memory, not to mention device memory. In order to make the C APIs work with Python, it is important that memory management is properly done through Python proxy objects. In cuQuantum Python, we ask users to address such needs using NumPy (for host memory) and CuPy (for device memory).
Note
It is also possible to use array.array
(plus memoryview
as needed) to manage host memory, however it is more
tedious compared to using numpy.ndarray
, especially when it comes to array manipulation and computation.
Note
It is also possible to use CUDA Python to manage device memory, but as of CUDA 11 there is no simple, pythonic way to modify the contents stored on GPU, which requires custom kernels. CuPy is a lightweight, NumPy-compatible array library to address this need.
To pass data from Python to C, using pointer addresses (as Python int
) of various objects is required. With NumPy/CuPy arrays
as the proxy, it is as simple as follows:
# create a host buffer to hold 5 int
buf = numpy.empty((5,), dtype=numpy.int32)
# pass buf's pointer to the wrapper
# buf could get modified in-place if the function writes to it
my_func(..., buf.ctypes.data, ...)
# examine/use buf's data
print(buf)
# create a device buffer to hold 10 double
buf = cupy.empty((10,), dtype=cupy.float64)
# pass buf's pointer to the wrapper
# buf could get modified in-place if the function writes to it
my_func(..., buf.data.ptr, ...)
# examine/use buf's data
print(buf)
# create an untyped device buffer of 128 bytes
buf = cupy.cuda.alloc(128)
# pass buf's pointer to the wrapper
# buf could get modified in-place if the function writes to it
my_func(..., buf.ptr, ...)
# buf is automatically destroyed when going out of scope
Please be aware that the underlying assumption is that the arrays must be contiguous in memory (unless the C interface allows for specifying the array strides).
As a consequence, for example, as of cuQuantum Python v0.1.0 all C structs (including handles and descriptors) are not exposed
as Python classes; that is, they do not have their own types and are simply cast to plain Python int
for passing around. Any
downstream consumer should create a wrapper class to hold the pointer address if so desired. In other words, users have full
control (and responsibility) for managing the pointer lifetime.
However, in certain cases we are able to convert Python objects for users (if readonly, host arrays are needed) so as to alleviate users’ burden. For example, in functions that require a sequence or a nested sequence, the following operations are equivalent:
# passing a host buffer of int type can be done like this
buf = numpy.array([0, 1, 3, 5, 6], dtype=numpy.int32)
my_func(..., buf.ctypes.data, ...)
# or just this
buf = [0, 1, 3, 5, 6]
my_func(..., buf, ...) # the underlying data type is determined by the C API
which is particularly useful when users need to pass a large number of tensor metadata to C (ex: cutensornet.create_network_descriptor()
).
User-provided memory pools¶
Starting cuQuantum v22.03, we offer an interface for users to bring in their memory pool for the cuStateVec/cuTensorNet libraries to use. Once set, users are no longer required to manage any temporary workspace before calling an API; the library will draw memory from the user’s pool (and return it back once done). The only requirement for the memory pool is it must be stream-ordered. See Memory Management API for an introduction. Currently we only support device mempools.
In cuQuantum Python, this interface is exposed with low-level APIs custatevec.set_device_mem_handler()
and custatevec.get_device_mem_handler()
(likewise for cutensornet
). Currently we offer three different ways to set the handler
argument:
if an
int
is given, it is assumed to be a pointer address to a fully initializedcustatevecDeviceMemHandler_t
structif a Python sequence of length 4, it is assumed to be
(ctx, device_alloc, device_free, name)
if a Python sequence of length 3, it is assumed to be
(malloc, free, name)
see the API reference for further detail. Once set, using the calling convention
setting the workspace (or workspace descriptor) pointer address to
0
setting the workspace size to
0
wherever an API needs a workspace will notify the library that it should use the user mempool. This example demonstrates the usage of this API.
Usage example¶
The code below is a Python translation of the corresponding cuStateVec example written in C.
import numpy as np
import cupy as cp
from cuquantum import custatevec as cusv
from cuquantum import cudaDataType as cudtype
from cuquantum import ComputeType as ctype
nIndexBits = 3
nSvSize = (1 << nIndexBits)
nTargets = 1
nControls = 2
adjoint = 0
targets = (2,)
controls = (0, 1)
d_sv = cp.asarray([[0.0, 0.0], [0.0, 0.1], [0.1, 0.1], [0.1, 0.2],
[0.2, 0.2], [0.3, 0.3], [0.3, 0.4], [0.4, 0.5]], dtype=np.float64)
d_sv = d_sv.view(np.complex128).reshape(-1)
d_sv_result = cp.asarray([[0.0, 0.0], [0.0, 0.1], [0.1, 0.1], [0.4, 0.5],
[0.2, 0.2], [0.3, 0.3], [0.3, 0.4], [0.1, 0.2]], dtype=np.float64)
d_sv_result = d_sv_result.view(np.complex128).reshape(-1)
d_matrix = cp.asarray([[0.0, 0.0], [1.0, 0.0], [1.0, 0.0], [0.0, 0.0]], dtype=np.float64)
d_matrix = d_matrix.view(np.complex128).reshape(-1)
# cuStateVec handle initialization
handle = cusv.create()
# check the size of external workspace
extraWorkspaceSizeInBytes = cusv.apply_matrix_get_workspace_size(
handle, cudtype.CUDA_C_64F, nIndexBits, d_matrix.data.ptr, cudtype.CUDA_C_64F,
cusv.MatrixLayout.ROW, adjoint, nTargets, nControls, ctype.COMPUTE_64F)
# allocate external workspace if necessary
if extraWorkspaceSizeInBytes > 0:
workspace = cp.cuda.alloc(extraWorkspaceSizeInBytes)
workspace_ptr = workspace.ptr
else:
workspace_ptr = 0
# apply gate
cusv.apply_matrix(
handle, d_sv.data.ptr, cudtype.CUDA_C_64F, nIndexBits,
d_matrix.data.ptr, cudtype.CUDA_C_64F, cusv.MatrixLayout.ROW, adjoint,
targets, len(targets), controls, 0, len(controls), ctype.COMPUTE_64F,
workspace_ptr, extraWorkspaceSizeInBytes)
# destroy handle
cusv.destroy(handle)
# --------------------------------------------------------------------------
# check if d_sv holds the updated statevector
correct = cp.allclose(d_sv, d_sv_result)
if not correct:
raise RuntimeError("example FAILED: wrong result")
# if this is a standalone script, everything is cleaned up properly at exit
High-level pythonic APIs¶
Introduction¶
The goal behind the high-level APIs is to provide an interface to the cuTensorNet library that feels natural for Python programmers. The APIs support ndarray-like objects from NumPy, CuPy, and PyTorch and support specification of the tensor network as an Einstein summation expression.
The high-level APIs can be further categorized into two levels:
The “coarse-grained” level, where the user deals with Python functions like
contract()
,contract_path()
,einsum()
, andeinsum_path()
. The coarse-grained level is an abstraction layer that is typically meant for single contraction operations.The “fine-grained” level, where the interaction is through operations on a
Network
object. The fine-grained level allows the user to invest significant resources into finding an optimal contraction path and autotuning the network where repeated contractions on the same network object allow for amortization of the cost (see also Resource management in stateful objects).
The APIs also allow for interoperability between the cuTensorNet library and external packages. For example, the user can specify a contraction order obtained from a different package (perhaps a research project). Alternatively, the user can obtain the contraction order and the sliced modes from cuTensorNet for downstream use elsewhere.
Usage example¶
Contracting the same tensor network demonstrated in the cuTensorNet C example is as simple as:
from cuquantum import contract
from numpy.random import rand
a = rand(96,64,64,96)
b = rand(96,64,64)
c = rand(64,96,64)
r = contract("mhkn,ukh,xuy->mxny", a, b, c)
If desired, various options can be provided for the contraction.
For PyTorch tensors, starting cuQuantum Python v23.10 the contract()
function works like a native PyTorch operator that can be
recorded in the autograd graph and generate backward-mode automatic differentiation.
See contract()
for more details and examples.
Starting cuQuantum v22.11 / cuTensorNet v2.0.0, cuTensorNet supports automatic MPI parallelism if users bind an MPI communicator to the library handle, among other requirements as outlined here. To illustrate, assuming all processes hold the same set of input tensors (on distinct GPUs) and the same network expression, this should work out of box:
from cupy.cuda.runtime import getDeviceCount
from mpi4py import MPI
from cuquantum import cutensornet as cutn
# bind comm to cuTensorNet handle
handle = cutn.create()
comm = MPI.COMM_WORLD
cutn.distributed_reset_configuration(
handle, *cutn.get_mpi_comm_pointer(comm))
# make each process run on different GPU
rank = comm.Get_rank()
device_id = rank % getDeviceCount()
cp.cuda.Device(device_id).use()
# 1. assuming input tensors a, b, and c are created on the right GPU
# 2. passing handle explicitly allows reusing it to reduce the handle creation overhead
r = contract(
"mhkn,ukh,xuy->mxny", a, b, c,
options={'device_id' : device_id, 'handle': handle}))
An end-to-end Python example of such auto-MPI usage can be found at https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/cutensornet/coarse/example22_mpi_auto.py.
Note
As of cuQuantum v22.11 / cuTensorNet v2.0.0, the Python wheel does not have the required MPI wrapper library included. Users need to either build it from source (included in the wheel), or use the Conda package from conda-forge instead.
Finally, for users seeking full control over the tensor network operations and parallelization, we offer fine-grained APIs as illustrated by the examples in the documentation for Network
. A complete example
illustrating parallel implementation of tensor network contraction using the fine-grained API is shown below:
from cupy.cuda.runtime import getDeviceCount
from mpi4py import MPI
import numpy as np
from cuquantum import Network
root = 0
comm = MPI.COMM_WORLD
rank, size = comm.Get_rank(), comm.Get_size()
expr = 'ehl,gj,edhg,bif,d,c,k,iklj,cf,a->ba'
shapes = [(8, 2, 5), (5, 7), (8, 8, 2, 5), (8, 6, 3), (8,), (6,), (5,), (6, 5, 5, 7), (6, 3), (3,)]
# Set the operand data on root.
operands = [np.random.rand(*shape) for shape in shapes] if rank == root else None
# Broadcast the operand data.
operands = comm.bcast(operands, root)
# Assign the device for each process.
device_id = rank % getDeviceCount()
# Create network object.
network = Network(expr, *operands, options={'device_id' : device_id})
# Compute the path on all ranks with 8 samples for hyperoptimization. Force slicing to enable parallel contraction.
path, info = network.contract_path(optimize={'samples': 8, 'slicing': {'min_slices': max(16, size)}})
# Select the best path from all ranks.
opt_cost, sender = comm.allreduce(sendobj=(info.opt_cost, rank), op=MPI.MINLOC)
if rank == root:
print(f"Process {sender} has the path with the lowest FLOP count {opt_cost}.")
# Broadcast info from the sender to all other ranks.
info = comm.bcast(info, sender)
# Set path and slices.
path, info = network.contract_path(optimize={'path': info.path, 'slicing': info.slices})
# Calculate this process's share of the slices.
num_slices = info.num_slices
chunk, extra = num_slices // size, num_slices % size
slice_begin = rank * chunk + min(rank, extra)
slice_end = num_slices if rank == size - 1 else (rank + 1) * chunk + min(rank + 1, extra)
slices = range(slice_begin, slice_end)
print(f"Process {rank} is processing slice range: {slices}.")
# Contract the group of slices the process is responsible for.
result = network.contract(slices=slices)
# Sum the partial contribution from each process on root.
result = comm.reduce(sendobj=result, op=MPI.SUM, root=root)
# Check correctness.
if rank == root:
result_np = np.einsum(expr, *operands, optimize=True)
print("Does the cuQuantum parallel contraction result match the numpy.einsum result?", np.allclose(result, result_np))
This “manual” MPI Python example can be found in the NVIDIA/cuQuantum repository (here).
Call blocking behavior¶
By default, calls to the execution APIs (Network.autotune()
and Network.contract()
on the Network
object as well as the function contract()
) block and do not return
until the operation is completed. This behavior can be changed by setting NetworkOptions.blocking
and passing in the options to
Network
. When NetworkOptions.blocking
is set to 'auto'
, calls to the execution APIs will return immediately after
the operation is launched on the GPU without waiting for it to complete if the input tensors are on the device. If the input
tensors are on the host, the execution API calls will always block since the result of the contraction is a tensor that will also reside on the host.
APIs that execute on the host (such as Network.contract_path()
on the Network
object, and contract_path()
, and einsum_path()
functions) always block.
Stream semantics¶
The stream semantics depends on whether the behavior of the execution APIs is chosen to be blocking or non-blocking (see Call blocking behavior).
For blocking behavior, stream ordering is automatically handled by the cuQuantum Python high-level APIs for operations that are performed within the package. A stream can be provided for two reasons:
1. When the computation that prepares the input tensors is not already complete by the time the execution APIs are called. This is a correctness requirement for user-provided data. 2. To enable parallel computations across multiple streams if the device has sufficient resources and the current stream (which is the default) has concomitant operations. This can be done for performance reasons.
For non-blocking behavior, it is the user’s responsibility to ensure correct stream ordering between the execution API calls.
In any case, the execution APIs are launched on the provided stream.
Resource management in stateful objects¶
An important aspect of the fine-grained, stateful object APIs (e.g. Network
) is resource management. We need to make sure that the internal resources, including library resources, memory resources, and user-provided input operands are properly managed throughout the object’s lifetime safely. As such, there are caveats that the users should be aware of, due to their impact on the memory watermark.
The stateful object APIs allow users to prepare an object and reuse it for multiple contractions or gradient calculations to amortize the preparation cost. Depending on the specific problem, investment in preparations that lead to shorter execution time may be the ideal solution. During the preparation step, an object would inevitably hold reference to device memory for later reuse. However, such problems oftentimes imply high memory usage, making it impossible to allow holding multiple objects at the same time. Interleaving contractions of multiple large tensor networks is an example of this.
To address this use case, starting cuQuantum Python v24.03 two new features are added:
Every execution method now accepts a
release_workspace
option. When this option is set toTrue
(default isFalse
), the memory needed to perform an operation is freed before the method returns, making this memory available for other tasks. The next time the same (or different) method is called, memory is allocated on demand. Therefore, there is a small overhead associated withrelease_workspace=True
, as allocating/deallocating memory could take time depending on the implementation of the underlying memory allocator (see next section); however, making multipleNetwork
objects coexist becomes possible, see, e.g., the example6_resource_mgmt_contraction.py sample.The
reset_operands()
method now accepts settingoperands=None
to free the internal reference to the input operands after the execution. This reduces potential memory contention and thereby allows contracting multiple networks with large input tensors in an interleaved fashion. In such cases, before a subsequent execution on the sameNetwork
object is called, thereset_operands()
method should be called again with the new operands, see, e.g., the example8_reset_operand_none.py sample.
These two features, used separately or jointly as the problem requires, make it possible to prepare and use a large number of Network
objects when the device memory available is not enough to fit all problems at once.
External memory management¶
Starting cuQuantum Python v22.03, we support an EMM-like interface as proposed and supported by Numba for users to set their Python
mempool. Users set the option NetworkOptions.allocator
to a Python object complying with the cuquantum.BaseCUDAMemoryManager
protocol, and pass the options to the high-level APIs like contract()
or Network
. Temporary memory allocations will then
be done through this interface. (Internally, we use the same interface to use CuPy or PyTorch’s mempool depending on the input tensor
operands.)
Note
cuQuantum’s BaseCUDAMemoryManager
protocol is slightly different from Numba’s EMM interface
(numba.cuda.BaseCUDAMemoryManager
), but duck typing with an existing EMM instance (not type!) at runtime
should be possible.
Circuit to tensor network converter¶
Introduction¶
Starting cuQuantum Python v22.07, we provide a CircuitToEinsum
converter that takes either a qiskit.QuantumCircuit
or a
cirq.Circuit
and generates the corresponding tensor network contraction for the target operation. The goal of the converter is to allow Qiskit and Cirq users to easily
explore the functionalities of the cuTensorNet library. As mentioned in the tensor network introduction,
quantum circuits can be viewed as tensor networks. For any quantum circuit, CircuitToEinsum
can construct the corresponding tensor network
to compute various quantities of interest. The output tensor network is returned as an Einstein summation expression with tensor operands.
We support the following operations:
state_vector()
: The contraction of this Einstein summation expression yields the final state coefficients as an N-dimensional tensor where N is the number of qubits in the circuit. The mode labels of the tensor correspond to theCircuitToEinsum.qubits
.
amplitude()
: The contraction of this Einstein summation expression yields the amplitude coefficient for a given bitstring.
batched_amplitudes()
: The contraction of this Einstein summation expression yields the amplitude coefficients for a subset of qubits while the others are fixed at certain states.
reduced_density_matrix()
: The contraction of this Einstein summation expression yields the reduced density matrix for a subset of qubits, optionally with another subset of qubits set to a fixed state.
expectation()
: The contraction of this Einstein summation expression yields the expectation value for a given Pauli string.
The CircuitToEinsum
class also allows user to specify a desired tensor backend (cupy
, torch
, numpy
) via the backend
argument when constructing the converter object.
The returned Einstein summation expression and tensor operands can then directly serve as the input arguments for cuquantum.contract()
or the corresponding backend’s einsum
function.
Usage example¶
import cirq
import cupy
from cuquantum import contract, CircuitToEinsum
# create a random cirq.Circuit
circuit = cirq.testing.random_circuit(qubits=4, n_moments=4, op_density=0.9, random_state=1)
# same task can be achieved with qiskit.circuit.random.random_circuit
# construct the CircuitToEinsum converter targeting double precision and cupy operands
converter = CircuitToEinsum(circuit, dtype='complex128', backend='cupy')
# generate the Einstein summation expression and tensor operands for computing the amplitude coefficient of bitstring 0000
expression, operands = converter.amplitude(bitstring='0000')
assert all([isinstance(op, cupy.ndarray) for op in operands])
# contract the network to compute the amplitude
amplitude = contract(expression, *operands)
amplitude_cupy = cupy.einsum(expression, *operands)
assert cupy.allclose(amplitude, amplitude_cupy)
Multiple Jupyter notebooks are available for Cirq and Qiskit users to easily build up their tensor network based simulations using cuTensorNet.
Tensor network simulator¶
Introduction¶
Starting from cuQuantum Python v24.08, we provide new APIs that enable Python users to easily leverage cuTensorNet state APIs for tensor network simulation.
These APIs are now available under the cuquantum.cutensornet.experimental
module and may be subject to change in future releases.
Please share your feedback with us on NVIDIA/cuQuantum GitHub Discussions!
The new set of APIs are centered around the NetworkState
class and are designed to support the following groups of users:
Quantum Computing Framework Users: These users can directly initialize a tensor network state from quantum circuit objects such as
cirq.Circuit
orqiskit.QuantumCircuit
via theNetworkState.from_circuit()
method.Tensor Network Framework Developers and Researchers: These users can build any state of interest by applying tensor operators, matrix product operators (MPOs), and set initial state to a matrix product state (MPS). The methods involved here include
NetworkState.apply_tensor_operator()
,NetworkState.update_tensor_operator()
,NetworkState.set_initial_mps()
,NetworkState.apply_mpo()
, andNetworkState.apply_network_operator()
. Ndarray-like objects from NumPy, CuPy, and PyTorch are all supported as input operands.
Note
The
NetworkState
class supports arbitrary state dimensions beyond regular quantum circuit states with qubits (d=2
). An example of simulating a complex state with non-uniform state dimensions can be found in the arbitrary state example.For both MPS and MPO, only open boundary condition is supported.
Users can further specify the tensor network simulation method as one of the following:
Contraction-Based Simulations: Specify
config
as aTNConfig
object.MPS-Based Simulations: Specify
config
as aMPSConfig
object, offering detailed control over truncation extents, canonical centers, SVD algorithms, and normalization options.
Once the problem is fully specified, users can take advantage of the following execution APIs to compute various properties:
NetworkState.compute_state_vector()
: Computes the final state coefficients as an N-dimensional tensor with extents matching the specified state.NetworkState.compute_amplitude()
: Computes the amplitude coefficient for a given bitstring.NetworkState.compute_batched_amplitudes()
: Computes the batched amplitude coefficients for a subset of state dimensions while others are fixed at certain states.NetworkState.compute_reduced_density_matrix()
: Computes the reduced density matrix for a subset of state dimensions, optionally fixing another subset to specific states.NetworkState.compute_expectation()
: Computes the expectation value for a given tensor network operator, which can be specified as a sum of tensor product (such as Pauli operators) or MPOs with coefficients.NetworkState.compute_sampling()
: Draws samples from the underlying state, with options to sample just a subset of all state dimensions.NetworkState.compute_norm()
: Computes the norm of the tensor network state.
Additionally, the NetworkOperator
class allows users to create a network operator object as a sum of tensor products (via NetworkOperator.append_product()
) or MPOs (via NetworkOperator.append_mpo()
) with coefficients.
This object can then interact with the NetworkState
class, enabling users to apply an MPO to the state or compute the expectation value of the operator on the state using methods like NetworkState.apply_network_operator()
and NetworkState.compute_expectation()
.
Caching feature¶
As of cuQuantum v24.08, the NetworkState
offers preliminary caching support for all execution methods with a compute_
suffix when contraction-based tensor network simulation or MPS simulation without value based truncation is used.
During the first call to these methods, the underlying cuTensorNet C object for these properties will be created, prepared, cached, and then executed to compute the final output.
On subsequent calls to the same method using compatible parameters without updating the state with NetworkState.apply_tensor_operator()
,
NetworkState.apply_mpo()
, NetworkState.set_initial_mps()
, or NetworkState.apply_network_operator()
(it’s okay to call NetworkState.update_tensor_operator()
), the cached C object will be reused to compute the final output,
thus reducing the overhead of C object creation and preparation.
Compatible parameters have different contexts for different execution methods:
For
NetworkState.compute_state_vector()
,NetworkState.compute_amplitude()
, andNetworkState.compute_norm()
, any parameters will result in using the same cached object.For
NetworkState.compute_batched_amplitudes()
, the set of state dimensions specified byfixed
must be identical while the fixed state for each dimension may differ.For
NetworkState.compute_reduced_density_matrix()
, thewhere
parameter and the set of state dimensions specified byfixed
must be identical while the fixed state for each dimension may differ.For
NetworkState.compute_expectation()
, the sameNetworkOperator
object with unchanged underlying components must be used. Providingoperators
as a string of Pauli operators or as a dictionary mapping Pauli strings to coefficients will not activate the caching mechanism.For
NetworkState.compute_sampling()
, the samemodes
parameter is required to activate the caching mechanism.
For more details, please refer to our cirq caching example and qiskit caching example.
Additionally, users can leverage the caching feature along with the NetworkState.update_tensor_operator()
method to reduce the overhead for variational workflows where the same computation needs to be performed on numerous states with identical topologies.
For more details, please refer to our variational workflow example.
MPI support¶
As of cuQuantum v24.08, the NetworkState
offers preliminary distributed parallel support for all execution methods with a compute_
suffix when contraction-based tensor network simulation is used, i.e., TNConfig
.
To activate distributed parallel execution, users must perform the following tasks:
Explicitly set the device ID to use in
cuquantum.NetworkOptions.device_id
and provide it toNetworkState
via theoptions
parameter.Explicitly create the library handle on the corresponding device using
cuquantum.cutensornet.create()
, bind an MPI communicator to the library handle usingcuquantum.cutensornet.distributed_reset_configuration()
, and provide it toNetworkState
via theoptions
parameter.
For more details, please refer to our cirq mpi sampling example and qiskit mpi sampling example.
Tensor and tensor network decomposition¶
Introduction¶
Decomposition methods such as QR and SVD are prevalent in tensor network algorithms, as they allow one to exploit the sparsity of the network and thus reduce the computational cost.
Starting with cuQuantum Python v23.03, we provide these functionalities at both the tensor and tensor network levels.
The tensor level decomposition routines are implemented inside the module cuquantum.cutensornet.tensor
with the following features:
QR decomposition can be performed using
cuquantum.cutensornet.tensor.decompose()
withcuquantum.cutensornet.tensor.QRMethod
.Both exact and truncated SVD can be performed using
cuquantum.cutensornet.tensor.decompose()
withcuquantum.cutensornet.tensor.SVDMethod
.Decomposition options can be specified by
cuquantum.cutensornet.tensor.DecompositionOptions
.
As of cuQuantum Python v23.03, the tensor network level decomposition routines are implemented in the experimental subpackage cuquantum.cutensornet.experimental
with the main API cuquantum.cutensornet.experimental.contract_decompose()
.
Given an input tensor network, this function can perform a full contraction followed by a QR or SVD decomposition. This can be specified via cuquantum.cutensornet.experimental.ContractDecomposeAlgorithm
.
If the contract and decompose problem amounts to a ternary-operand gate split problem, commonly seen in quantum circuit simulation (see Gate Split Algorithm for details),
the user can potentially leverage QR decompositions to speed up the execution of contraction and SVD. This can be achieved by setting both cuquantum.cutensornet.experimental.ContractDecomposeAlgorithm.qr_method
and cuquantum.cutensornet.experimental.ContractDecomposeAlgorithm.svd_method
.
Note
The APIs inside cuquantum.cutensornet.experimental
are subject to change and may be integrated into the main package cuquantum.cutensornet
in a future release.
Users are encouraged to leave feedback on NVIDIA/cuQuantum GitHub Discussions.
Usage example¶
import cupy
from cuquantum import contract
from cuquantum.cutensornet.tensor import decompose
from cuquantum.cutensornet.experimental import contract_decompose
# create a random rank-4 tensor
a = cupy.random.random((2,2,2,2)) + cupy.random.random((2,2,2,2)) * 1j
# perform QR decomposition such that A[i,j,k,l] = \sum_{x} Q[i,x,k] R[x,j,l]
q, r = decompose('ijkl->ixk,xjl', a) # QR by default
# check the unitary property of q
identity = contract('ixk,iyk->xy', q, q.conj())
identity_reference = cupy.eye(identity.shape[0])
assert cupy.allclose(identity, identity_reference)
# check if the contraction of the decomposition outputs yields the input
a_reference = contract('ixk,xjl->ijkl', q, r)
assert cupy.allclose(a, a_reference)
More examples on tensor decompositions are available in our sample directory to demonstrate the use of QR and SVD in different settings.
For tensor network decompositions, please refer to this directory for more detailed examples. We have also provided a Jupyter notebook to demonstrate how to easily implement basic MPS algorithms using these new APIs.
Compatibility policy¶
cuQuantum Python is no different from any Python package, in that we would not succeed without depending on, collaborating with, and evolving alongside the Python community. Given these considerations, we strive to meet the following commitments:
For the low-level Python bindings, we support the latest cuQuantum SDK.
For the high-level pythonic APIs, we keep the APIs backward-compatible to the maximum extent possible. When a breaking change must happen, we raise a run-time warning in the release version YY.MM, also notify users in the release notes, and break it in the next release (after YY.MM). The only exception is the experimental submodule
cuquantum.cutensornet.experimental
where all APIs are subject to change between releases without prior notice.We comply with NEP-29 and support a community-defined set of core dependencies (CPython, NumPy, etc).
Citing cuQuantum¶
Bayraktar et al., “cuQuantum SDK: A High-Performance Library for Accelerating Quantum Science,” 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), Bellevue, WA, USA, 2023, pp. 1050-1061, doi: 10.1109/QCE57702.2023.00119.