Examples#

In this section, we show basic examples on how to define a quantum operator, quantum state, and then compute the action of the quantum operator on a quantum state, and, optionally, backward-differentiate the operator action (compute gradients) with respect to user-provided real parameters parameterizing the operator. We also show an example on how to compute the extreme eigenspectrum of a given operator. For clarity, the quantum operator for each example is defined inside a separate C++ header, specifically transverse_ising_full_fused.h, transverse_ising_full_fused_noisy.h and transverse_ising_full_fused_noisy_grad.h, where it is wrapped in a helper C++ class UserDefinedLiouvillian. We also provide a utility header helpers.h containing convenient GPU array creation/destruction, initialization, copying, and printing helper functions.

Building code#

Assuming cuQuantum has been extracted in CUQUANTUM_ROOT and cuTENSOR is in CUTENSOR_ROOT, we update the library path as follows:

export LD_LIBRARY_PATH=${CUQUANTUM_ROOT}/lib:${CUTENSOR_ROOT}/lib:${LD_LIBRARY_PATH}

A serial sample code discussed below (operator_action_example.cpp) can be built via the following command:

nvcc operator_action_example.cpp -I${CUQUANTUM_ROOT}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib -lcudensitymat -lcutensornet -lcutensor -lcusolver -lcublas -lcurand -o operator_action_example

For static linking against the cuDensityMat library, use the following command:

nvcc operator_action_example.cpp -I${CUQUANTUM_ROOT}/include ${CUQUANTUM_ROOT}/lib/libcudensitymat_static.a ${CUQUANTUM_ROOT}/lib/libcutensornet_static.a -L${CUTENSOR_ROOT}/lib -lcutensor -lcusolver -lcublas -lcurand -o operator_action_example

In order to build a parallel (MPI) version of the example operator_action_mpi_example.cpp, one will need to have a CUDA-aware MPI library installed (e.g., recent OpenMPI, MPICH or MVAPICH) and then set the environment variable $CUDENSITYMAT_COMM_LIB to the path to the MPI interface wrapper shared library libcudensitymat_distributed_interface_mpi.so. The MPI interface wrapper shared library libcudensitymat_distributed_interface_mpi.so can be built inside the ${CUQUANTUM_ROOT}/distributed_interfaces folder by calling the build script provided there. In order to link the executable to a CUDA-aware MPI library, one will need to add -I${MPI_PATH}/include and -L${MPI_PATH}/lib -lmpi to the build command:

nvcc operator_action_mpi_example.cpp -I${CUQUANTUM_ROOT}/include -I${MPI_PATH}/include -L${CUQUANTUM_ROOT}/lib -L${CUTENSOR_ROOT}/lib -L${MPI_PATH}/lib -lcudensitymat -lcutensornet -lcutensor -lcusolver -lcublas -lcurand -lmpi -o operator_action_mpi_example

Warning

When running operator_action_mpi_example.cpp with a non-CUDA-aware MPI library, the program will crash.

Note

Depending on the installation of the cuQuantum SDK package, you may need to replace lib above by lib64, depending which folder name is used inside your cuQuantum SDK package.

Code example (serial execution on a single GPU)#

The following code example illustrates the common steps necessary to use the cuDensityMat library to compute the action of a quantum many-body operator on a quantum state. The full sample code can be found in the NVIDIA/cuQuantum repository (main serial code and operator definition as well as the utility code).

First let’s introduce a helper class to construct a specific quantum many-body operator, for example, the transverse field Ising Hamiltonian with fused ZZ terms and an additional noise term. Here we choose to make the f(t) coefficient depend on time and a single user-provided real parameter Omega. We use a CPU-side user-defined scalar callback function to define the dependence of the f(t) coefficient on time and the user-provided real parameter Omega. Note that inside the callback function definition, we explicitly expect the data type to be CUDA_C_64F (double-precision complex numbers), which applies to the scalar coefficient f(t) set by the callback function in-place.

/* Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 */

#pragma once

#include <cudensitymat.h> // cuDensityMat library header
#include "helpers.h"      // GPU helper functions

#include <cmath>
#include <complex>
#include <vector>
#include <iostream>
#include <cassert>


/* DESCRIPTION:
   Time-dependent transverse-field Ising Hamiltonian operator
   with ordered and fused ZZ terms, plus fused unitary dissipation terms:
    H = sum_{i} {h_i * X_i}                // transverse field sum of X_i operators with static h_i coefficients 
      + f(t) * sum_{i < j} {g_ij * ZZ_ij}  // modulated sum of the fused ordered {Z_i * Z_j} terms with static g_ij coefficients
      + d * sum_{i} {Y_i * {..} * Y_i}     // scaled sum of the dissipation terms {Y_i * {..} * Y_i} fused into the YY_ii super-operators
   where {..} is the placeholder for the density matrix to show that the Y_i operators act from different sides.
*/

/** Define the numerical type and data type for the GPU computations (same) */
using NumericalType = std::complex<double>;      // do not change
constexpr cudaDataType_t dataType = CUDA_C_64F;  // do not change


/** Example of a user-provided scalar CPU callback C function
 *  defining a time-dependent coefficient inside the Hamiltonian:
 *  f(t) = exp(i * Omega * t) = cos(Omega * t) + i * sin(Omega * t)
 */
extern "C"
int32_t fCoefComplex64(
  double time,             //in: time point
  int64_t batchSize,       //in: user-defined batch size (number of coefficients in the batch)
  int32_t numParams,       //in: number of external user-provided Hamiltonian parameters (this function expects one parameter, Omega)
  const double * params,   //in: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of user-provided Hamiltonian parameters for all instances of the batch
  cudaDataType_t dataType, //in: data type (expecting CUDA_C_64F in this specific callback function)
  void * scalarStorage,    //inout: CPU-accessible storage for the returned coefficient value(s) of shape [0:batchSize-1]
  cudaStream_t stream)     //in: CUDA stream (default is 0x0)
{
  if (dataType == CUDA_C_64F) {
    auto * tdCoef = static_cast<cuDoubleComplex*>(scalarStorage); // casting to cuDoubleComplex because this callback function expects CUDA_C_64F data type
    for (int64_t i = 0; i < batchSize; ++i) {
      const auto omega = params[i * numParams + 0]; // params[0][i]: 0-th parameter for i-th instance of the batch
      tdCoef[i] = make_cuDoubleComplex(std::cos(omega * time), std::sin(omega * time)); // value of the i-th instance of the coefficients batch
    }
  } else {
    return 1; // error code (1: Error)
  }
  return 0; // error code (0: Success)
}


/** Convenience class which encapsulates a user-defined Liouvillian operator (system Hamiltonian + dissipation terms):
 *  - Constructor constructs the desired Liouvillian operator (`cudensitymatOperator_t`)
 *  - Method `get()` returns a reference to the constructed Liouvillian operator
 *  - Destructor releases all resources used by the Liouvillian operator
 */
class UserDefinedLiouvillian final
{
private:
  // Data members
  cudensitymatHandle_t handle;             // library context handle
  int64_t stateBatchSize;                  // quantum state batch size
  const std::vector<int64_t> spaceShape;   // Hilbert space shape (extents of the modes of the composite Hilbert space)
  void * spinXelems {nullptr};             // elements of the X spin operator in GPU RAM (F-order storage)
  void * spinYYelems {nullptr};            // elements of the fused YY two-spin operator in GPU RAM (F-order storage)
  void * spinZZelems {nullptr};            // elements of the fused ZZ two-spin operator in GPU RAM (F-order storage)
  cudensitymatElementaryOperator_t spinX;  // X spin operator (elementary tensor operator)
  cudensitymatElementaryOperator_t spinYY; // fused YY two-spin operator (elementary tensor operator)
  cudensitymatElementaryOperator_t spinZZ; // fused ZZ two-spin operator (elementary tensor operator)
  cudensitymatOperatorTerm_t oneBodyTerm;  // operator term: H1 = sum_{i} {h_i * X_i} (one-body term)
  cudensitymatOperatorTerm_t twoBodyTerm;  // operator term: H2 = f(t) * sum_{i < j} {g_ij * ZZ_ij} (two-body term)
  cudensitymatOperatorTerm_t noiseTerm;    // operator term: D1 = d * sum_{i} {YY_ii}  // Y_i operators act from different sides on the density matrix (two-body mixed term)
  cudensitymatOperator_t liouvillian;      // full operator: (-i * (H1 + H2) * {..}) + (i * {..} * (H1 + H2)) + D1{..} (super-operator)

public:

  // Constructor constructs a user-defined Liouvillian operator
  UserDefinedLiouvillian(cudensitymatHandle_t contextHandle,             // library context handle
                         const std::vector<int64_t> & hilbertSpaceShape, // Hilbert space shape
                         int64_t batchSize):                             // batch size for the quantum state
    handle(contextHandle), stateBatchSize(batchSize), spaceShape(hilbertSpaceShape)
  {
    // Define the necessary operator tensors in GPU memory (F-order storage!)
    spinXelems = createInitializeArrayGPU<NumericalType>(  // X[i0; j0]
                  {{0.0, 0.0}, {1.0, 0.0},   // 1st column of matrix X
                   {1.0, 0.0}, {0.0, 0.0}}); // 2nd column of matrix X

    spinYYelems = createInitializeArrayGPU<NumericalType>(  // YY[i0, i1; j0, j1] := Y[i0; j0] * Y[i1; j1]
                    {{0.0, 0.0},  {0.0, 0.0}, {0.0, 0.0}, {-1.0, 0.0},  // 1st column of matrix YY
                     {0.0, 0.0},  {0.0, 0.0}, {1.0, 0.0}, {0.0, 0.0},   // 2nd column of matrix YY
                     {0.0, 0.0},  {1.0, 0.0}, {0.0, 0.0}, {0.0, 0.0},   // 3rd column of matrix YY
                     {-1.0, 0.0}, {0.0, 0.0}, {0.0, 0.0}, {0.0, 0.0}}); // 4th column of matrix YY

    spinZZelems = createInitializeArrayGPU<NumericalType>(  // ZZ[i0, i1; j0, j1] := Z[i0; j0] * Z[i1; j1]
                    {{1.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},  {0.0, 0.0},   // 1st column of matrix ZZ
                     {0.0, 0.0}, {-1.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},   // 2nd column of matrix ZZ
                     {0.0, 0.0}, {0.0, 0.0},  {-1.0, 0.0}, {0.0, 0.0},   // 3rd column of matrix ZZ
                     {0.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},  {1.0, 0.0}}); // 4th column of matrix ZZ

    // Construct the necessary Elementary Tensor Operators
    //  X_i operator
    HANDLE_CUDM_ERROR(cudensitymatCreateElementaryOperator(handle,
                        1,                                   // one-body operator
                        std::vector<int64_t>({2}).data(),    // acts in tensor space of shape {2}
                        CUDENSITYMAT_OPERATOR_SPARSITY_NONE, // dense tensor storage
                        0,                                   // 0 for dense tensors
                        nullptr,                             // nullptr for dense tensors
                        dataType,                            // data type
                        spinXelems,                          // tensor elements in GPU memory
                        cudensitymatTensorCallbackNone,      // no tensor callback function (tensor is not time-dependent)
                        cudensitymatTensorGradientCallbackNone, // no tensor gradient callback function
                        &spinX));                            // the created elementary tensor operator
    //  ZZ_ij = Z_i * Z_j fused operator
    HANDLE_CUDM_ERROR(cudensitymatCreateElementaryOperator(handle,
                        2,                                   // two-body operator
                        std::vector<int64_t>({2,2}).data(),  // acts in tensor space of shape {2,2}
                        CUDENSITYMAT_OPERATOR_SPARSITY_NONE, // dense tensor storage
                        0,                                   // 0 for dense tensors
                        nullptr,                             // nullptr for dense tensors
                        dataType,                            // data type
                        spinZZelems,                         // tensor elements in GPU memory
                        cudensitymatTensorCallbackNone,      // no tensor callback function (tensor is not time-dependent)
                        cudensitymatTensorGradientCallbackNone, // no tensor gradient callback function
                        &spinZZ));                           // the created elementary tensor operator
    //  YY_ii = Y_i * {..} * Y_i fused operator (note action from different sides)
    HANDLE_CUDM_ERROR(cudensitymatCreateElementaryOperator(handle,
                        2,                                   // two-body operator
                        std::vector<int64_t>({2,2}).data(),  // acts in tensor space of shape {2,2}
                        CUDENSITYMAT_OPERATOR_SPARSITY_NONE, // dense tensor storage
                        0,                                   // 0 for dense tensors
                        nullptr,                             // nullptr for dense tensors
                        dataType,                            // data type
                        spinYYelems,                         // tensor elements in GPU memory
                        cudensitymatTensorCallbackNone,      // no tensor callback function (tensor is not time-dependent)
                        cudensitymatTensorGradientCallbackNone, // no tensor gradient callback function
                        &spinYY));                           // the created elementary tensor operator

    // Construct the necessary Operator Terms from tensor products of Elementary Tensor Operators
    //  Create an empty operator term
    HANDLE_CUDM_ERROR(cudensitymatCreateOperatorTerm(handle,
                        spaceShape.size(),                   // Hilbert space rank (number of modes)
                        spaceShape.data(),                   // Hilbert space shape (mode extents)
                        &oneBodyTerm));                      // the created empty operator term
    //  Define the operator term: H1 = sum_{i} {h_i * X_i}
    for (int32_t i = 0; i < spaceShape.size(); ++i) {
      const double h_i = 1.0 / static_cast<double>(i+1); // assign some value to the time-independent h_i coefficient
      HANDLE_CUDM_ERROR(cudensitymatOperatorTermAppendElementaryProduct(handle,
                          oneBodyTerm,
                          1,                                                             // number of elementary tensor operators in the product
                          std::vector<cudensitymatElementaryOperator_t>({spinX}).data(), // elementary tensor operators forming the product
                          std::vector<int32_t>({i}).data(),                              // space modes acted on by the operator product
                          std::vector<int32_t>({0}).data(),                              // space mode action duality (0: from the left; 1: from the right)
                          make_cuDoubleComplex(h_i, 0.0),                                // h_i constant coefficient: Always 64-bit-precision complex number
                          cudensitymatScalarCallbackNone,                                // no time-dependent coefficient associated with this operator product
                          cudensitymatScalarGradientCallbackNone));                      // no coefficient gradient associated with this operator product
    }
    //  Create an empty operator term
    HANDLE_CUDM_ERROR(cudensitymatCreateOperatorTerm(handle,
                        spaceShape.size(),                   // Hilbert space rank (number of modes)
                        spaceShape.data(),                   // Hilbert space shape (mode extents)
                        &twoBodyTerm));                      // the created empty operator term
    //  Define the operator term: H2 = f(t) * sum_{i < j} {g_ij * ZZ_ij}
    for (int32_t i = 0; i < spaceShape.size() - 1; ++i) {
      for (int32_t j = (i + 1); j < spaceShape.size(); ++j) {
        const double g_ij = -1.0 / static_cast<double>(i + j + 1); // assign some value to the time-independent g_ij coefficient
        HANDLE_CUDM_ERROR(cudensitymatOperatorTermAppendElementaryProduct(handle,
                            twoBodyTerm,
                            1,                                                              // number of elementary tensor operators in the product
                            std::vector<cudensitymatElementaryOperator_t>({spinZZ}).data(), // elementary tensor operators forming the product
                            std::vector<int32_t>({i, j}).data(),                            // space modes acted on by the operator product
                            std::vector<int32_t>({0, 0}).data(),                            // space mode action duality (0: from the left; 1: from the right)
                            make_cuDoubleComplex(g_ij, 0.0),                                // g_ij constant coefficient: Always 64-bit-precision complex number
                            cudensitymatScalarCallbackNone,                                 // no time-dependent coefficient associated with this operator product
                            cudensitymatScalarGradientCallbackNone));                       // no coefficient gradient associated with this operator product
      }
    }
    //  Create an empty operator term
    HANDLE_CUDM_ERROR(cudensitymatCreateOperatorTerm(handle,
                        spaceShape.size(),                   // Hilbert space rank (number of modes)
                        spaceShape.data(),                   // Hilbert space shape (mode extents)
                        &noiseTerm));                        // the created empty operator term
    //  Define the operator term: D1 = d * sum_{i} {YY_ii}
    for (int32_t i = 0; i < spaceShape.size(); ++i) {
      HANDLE_CUDM_ERROR(cudensitymatOperatorTermAppendElementaryProduct(handle,
                          noiseTerm,
                          1,                                                              // number of elementary tensor operators in the product
                          std::vector<cudensitymatElementaryOperator_t>({spinYY}).data(), // elementary tensor operators forming the product
                          std::vector<int32_t>({i, i}).data(),                            // space modes acted on by the operator product (from different sides)
                          std::vector<int32_t>({0, 1}).data(),                            // space mode action duality (0: from the left; 1: from the right)
                          make_cuDoubleComplex(1.0, 0.0),                                 // default coefficient: Always 64-bit-precision complex number
                          cudensitymatScalarCallbackNone,                                 // no time-dependent coefficient associated with this operator product
                          cudensitymatScalarGradientCallbackNone));                       // no coefficient gradient associated with this operator product
    }

    // Construct the full Liouvillian operator as a sum of the operator terms
    //  Create an empty operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatCreateOperator(handle,
                        spaceShape.size(),                // Hilbert space rank (number of modes)
                        spaceShape.data(),                // Hilbert space shape (modes extents)
                        &liouvillian));                   // the created empty operator (super-operator)
    //  Append an operator term to the operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        oneBodyTerm,                      // appended operator term
                        0,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(0.0, -1.0),  // -i constant
                        cudensitymatScalarCallbackNone,   // no time-dependent coefficient associated with the operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with the operator term as a whole
    //  Append an operator term to the operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        twoBodyTerm,                     // appended operator term
                        0,                               // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(0.0, -1.0), // -i constant
                        {fCoefComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr}, // CPU scalar callback function defining the time-dependent coefficient associated with this operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with this operator term as a whole
    //  Append an operator term to the operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        oneBodyTerm,                      // appended operator term
                        1,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(0.0, 1.0),   // i constant
                        cudensitymatScalarCallbackNone,   // no time-dependent coefficient associated with the operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with the operator term as a whole
    //  Append an operator term to the operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        twoBodyTerm,                     // appended operator term
                        1,                               // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(0.0, 1.0),  // i constant
                        {fCoefComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr}, // CPU scalar callback function defining the time-dependent coefficient associated with this operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with this operator term as a whole
    //  Append an operator term to the operator (super-operator)
    const double d = 0.42; // assign some value to the time-independent coefficient
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        noiseTerm,                        // appended operator term
                        0,                                // operator term action duality as a whole (no duality reversing in this case)
                        make_cuDoubleComplex(d, 0.0),     // constant coefficient associated with the operator term as a whole
                        cudensitymatScalarCallbackNone,   // no time-dependent coefficient associated with the operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with the operator term as a whole
  }

  // Destructor destructs the user-defined Liouvillian operator
  ~UserDefinedLiouvillian()
  {
    // Destroy the Liouvillian operator
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperator(liouvillian));

    // Destroy operator terms
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorTerm(noiseTerm));
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorTerm(twoBodyTerm));
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorTerm(oneBodyTerm));

    // Destroy elementary tensor operators
    HANDLE_CUDM_ERROR(cudensitymatDestroyElementaryOperator(spinYY));
    HANDLE_CUDM_ERROR(cudensitymatDestroyElementaryOperator(spinZZ));
    HANDLE_CUDM_ERROR(cudensitymatDestroyElementaryOperator(spinX));

    // Destroy operator tensors
    destroyArrayGPU(spinYYelems);
    destroyArrayGPU(spinZZelems);
    destroyArrayGPU(spinXelems);
  }

  // Disable copy constructor/assignment (GPU resources are private, no deep copy)
  UserDefinedLiouvillian(const UserDefinedLiouvillian &) = delete;
  UserDefinedLiouvillian & operator=(const UserDefinedLiouvillian &) = delete;
  UserDefinedLiouvillian(UserDefinedLiouvillian &&) = delete;
  UserDefinedLiouvillian & operator=(UserDefinedLiouvillian &&) = delete;

  /** Returns the number of externally provided Hamiltonian parameters. */
  int32_t getNumParameters() const
  {
    return 1; // one parameter Omega
  }

  /** Get access to the constructed Liouvillian operator. */
  cudensitymatOperator_t & get()
  {
    return liouvillian;
  }

};

Now we can use this parameterized quantum many-body operator in our main code to compute the action of the operator on a mixed quantum state (density matrix).

/* Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 */

#include <cudensitymat.h>  // cuDensityMat library header
#include "helpers.h"       // helper functions


// Transverse Ising Hamiltonian with double summation ordering
// and spin-operator fusion, plus fused dissipation terms
#include "transverse_ising_full_fused_noisy.h"  // user-defined Liouvillian operator example

#include <cmath>
#include <complex>
#include <vector>
#include <chrono>
#include <iostream>
#include <cassert>


// Number of times to perform operator action on a quantum state
constexpr int NUM_REPEATS = 2;

// Logging verbosity
bool verbose = true;


// Example workflow
void exampleWorkflow(cudensitymatHandle_t handle)
{
  // Define the composite Hilbert space shape and
  // quantum state batch size (number of individual quantum states in a batched simulation)
  const std::vector<int64_t> spaceShape({2,2,2,2,2,2,2,2}); // dimensions of quantum degrees of freedom
  const int64_t batchSize = 1;                              // number of quantum states per batch (default is 1)

  if (verbose) {
    std::cout << "Hilbert space rank = " << spaceShape.size() << "; Shape = (";
    for (const auto & dimsn: spaceShape)
      std::cout << dimsn << ",";
    std::cout << ")" << std::endl;
    std::cout << "Quantum state batch size = " << batchSize << std::endl;
  }

  // Construct a user-defined Liouvillian operator using a convenience C++ class
  UserDefinedLiouvillian liouvillian(handle, spaceShape, batchSize);
  if (verbose)
    std::cout << "Constructed the Liouvillian operator\n";

  // Set and place external user-provided Hamiltonian parameters in GPU memory
  const int32_t numParams = liouvillian.getNumParameters(); // number of external user-provided Hamiltonian parameters
  std::vector<double> cpuHamParams(numParams * batchSize);
  for (int64_t j = 0; j < batchSize; ++j) {
    for (int32_t i = 0; i < numParams; ++i) {
      cpuHamParams[j * numParams + i] = double(i+1) / double(j+1); // just setting some parameter values for each instance of the batch
    }
  }
  auto * hamiltonianParams = static_cast<double *>(createInitializeArrayGPU(cpuHamParams));

  // Declare the input quantum state
  cudensitymatState_t inputState;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,
                      &inputState));

  // Query the size of the quantum state storage
  std::size_t storageSize {0}; // only one storage component (tensor) is needed (no tensor factorization)
  HANDLE_CUDM_ERROR(cudensitymatStateGetComponentStorageSize(handle,
                      inputState,
                      1,               // only one storage component (tensor)
                      &storageSize));  // storage size in bytes
  const std::size_t stateVolume = storageSize / sizeof(NumericalType);  // quantum state tensor volume (number of elements)
  if (verbose)
    std::cout << "Quantum state storage size (bytes) = " << storageSize << std::endl;

  // Prepare some initial value for the input quantum state batch
  std::vector<NumericalType> inputStateValue(stateVolume);
  if constexpr (std::is_same_v<NumericalType, float>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = 1.0f / float(i+1); // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, double>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = 1.0 / double(i+1); // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, std::complex<float>>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = NumericalType{1.0f / float(i+1), -1.0f / float(i+2)}; // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, std::complex<double>>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = NumericalType{1.0 / double(i+1), -1.0 / double(i+2)}; // just some value
    }
  } else {
    std::cerr << "Error: Unsupported data type!\n";
    std::exit(1);
  }
  // Allocate initialized GPU storage for the input quantum state with prepared values
  auto * inputStateElems = createInitializeArrayGPU(inputStateValue);
  if (verbose)
    std::cout << "Allocated input quantum state storage and initialized it to some value\n";

  // Attach initialized GPU storage to the input quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      inputState,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({inputStateElems}).data(),      // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed input quantum state\n";

  // Declare the output quantum state of the same shape
  cudensitymatState_t outputState;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,
                      &outputState));

  // Allocate initialized GPU storage for the output quantum state
  auto * outputStateElems = createArrayGPU<NumericalType>(stateVolume);
  if (verbose)
    std::cout << "Allocated output quantum state storage\n";

  // Attach GPU storage to the output quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      outputState,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({outputStateElems}).data(),     // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed output quantum state\n";

  // Declare a workspace descriptor
  cudensitymatWorkspaceDescriptor_t workspaceDescr;
  HANDLE_CUDM_ERROR(cudensitymatCreateWorkspace(handle, &workspaceDescr));

  // Query free GPU memory
  std::size_t freeMem = 0, totalMem = 0;
  HANDLE_CUDA_ERROR(cudaMemGetInfo(&freeMem, &totalMem));
  freeMem = static_cast<std::size_t>(static_cast<double>(freeMem) * 0.95); // take 95% of the free memory for the workspace buffer
  if (verbose)
    std::cout << "Max workspace buffer size (bytes) = " << freeMem << std::endl;

  // Prepare the Liouvillian operator action on a quantum state (needs to be done only once)
  const auto startTime = std::chrono::high_resolution_clock::now();
  HANDLE_CUDM_ERROR(cudensitymatOperatorPrepareAction(handle,
                      liouvillian.get(),
                      inputState,
                      outputState,
                      CUDENSITYMAT_COMPUTE_64F,  // GPU compute type
                      freeMem,                   // max available GPU free memory for the workspace
                      workspaceDescr,            // workspace descriptor
                      0x0));                     // default CUDA stream
  const auto finishTime = std::chrono::high_resolution_clock::now();
  const std::chrono::duration<double> timeSec = finishTime - startTime;
  if (verbose)
    std::cout << "Operator action preparation time (sec) = " << timeSec.count() << std::endl;

  // Query the required workspace buffer size (bytes)
  std::size_t requiredBufferSize {0};
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceGetMemorySize(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      &requiredBufferSize));
  if (verbose)
    std::cout << "Required workspace buffer size (bytes) = " << requiredBufferSize << std::endl;

  // Allocate GPU storage for the workspace buffer
  const std::size_t bufferVolume = requiredBufferSize / sizeof(NumericalType);
  auto * workspaceBuffer = createArrayGPU<NumericalType>(bufferVolume);
  if (verbose)
    std::cout << "Allocated workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Attach the workspace buffer to the workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceSetMemory(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      workspaceBuffer,
                      requiredBufferSize));
  if (verbose)
    std::cout << "Attached workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Apply the Liouvillian operator to the input quatum state
  // and accumulate its action into the output quantum state (note accumulative += semantics)
  for (int32_t repeat = 0; repeat < NUM_REPEATS; ++repeat) { // repeat multiple times for accurate timing
    // Zero out the output quantum state
    HANDLE_CUDM_ERROR(cudensitymatStateInitializeZero(handle,
                        outputState,
                        0x0));
    if (verbose)
      std::cout << "Initialized the output quantum state to zero\n";
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    const auto startTime = std::chrono::high_resolution_clock::now();
    HANDLE_CUDM_ERROR(cudensitymatOperatorComputeAction(handle,
                        liouvillian.get(),
                        0.3,                                   // time point (some value)
                        batchSize,                             // user-defined batch size
                        numParams,                             // number of external user-defined Hamiltonian parameters
                        hamiltonianParams,                     // external Hamiltonian parameters in GPU memory
                        inputState,                            // input quantum state
                        outputState,                           // output quantum state
                        workspaceDescr,                        // workspace descriptor
                        0x0));                                 // default CUDA stream
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    const auto finishTime = std::chrono::high_resolution_clock::now();
    const std::chrono::duration<double> timeSec = finishTime - startTime;
    if (verbose)
      std::cout << "Operator action computation time (sec) = " << timeSec.count() << std::endl;
  }

  // Compute the squared norm of the output quantum state
  void * norm2 = createInitializeArrayGPU(std::vector<double>(batchSize, 0.0));
  HANDLE_CUDM_ERROR(cudensitymatStateComputeNorm(handle,
                      outputState,
                      norm2,
                      0x0));
  if (verbose) {
    std::cout << "Computed the output quantum state norm:\n";
    printArrayGPU<double>(norm2, batchSize);
  }

  HANDLE_CUDA_ERROR(cudaDeviceSynchronize());

  // Destroy the norm2 array
  destroyArrayGPU(norm2);

  // Destroy workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatDestroyWorkspace(workspaceDescr));

  // Destroy workspace buffer storage
  destroyArrayGPU(workspaceBuffer);

  // Destroy quantum states
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(outputState));
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(inputState));

  // Destroy quantum state storage
  destroyArrayGPU(outputStateElems);
  destroyArrayGPU(inputStateElems);

  // Destroy external Hamiltonian parameters
  destroyArrayGPU(static_cast<void *>(hamiltonianParams));

  if (verbose)
    std::cout << "Destroyed resources\n" << std::flush;
}


int main(int argc, char ** argv)
{
  // Assign a GPU to the process
  HANDLE_CUDA_ERROR(cudaSetDevice(0));
  if (verbose)
    std::cout << "Set active device\n";

  // Create a library handle
  cudensitymatHandle_t handle;
  HANDLE_CUDM_ERROR(cudensitymatCreate(&handle));
  if (verbose)
    std::cout << "Created a library handle\n";

  // Run the example
  exampleWorkflow(handle);

  // Destroy the library handle
  HANDLE_CUDM_ERROR(cudensitymatDestroy(handle));
  if (verbose)
    std::cout << "Destroyed the library handle\n";

  HANDLE_CUDA_ERROR(cudaDeviceReset());

  // Done
  return 0;
}

Code example (parallel execution on multiple GPUs)#

It is straightforward to adapt the main serial code and enable parallel execution across multiple/many GPU devices (across multiple/many nodes). Two distributed communication backends are supported: MPI and NCCL (experimental).

MPI backend#

We will illustrate parallel execution with an example using the Message Passing Interface (MPI) as the communication layer. Below we show the minor additions that need to be made in order to enable distributed parallel execution without making any changes to the original serial source code.

The full sample code can be found in the NVIDIA/cuQuantum repository (main MPI code and operator definition as well as the utility code).

Here is the updated main code for multi-GPU runs using MPI.

/* Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 */

#include <cudensitymat.h>  // cuDensityMat library header
#include "helpers.h"       // helper functions


// Transverse Ising Hamiltonian with double summation ordering
// and spin-operator fusion, plus fused dissipation terms
#include "transverse_ising_full_fused_noisy.h"  // user-defined Liouvillian operator example


// MPI library (optional)
#ifdef MPI_ENABLED
#include <mpi.h>
#endif

#include <cmath>
#include <complex>
#include <vector>
#include <chrono>
#include <iostream>
#include <cassert>


// Number of times to perform operator action on a quantum state
constexpr int NUM_REPEATS = 2;

// Logging verbosity
bool verbose = true;


// Example workflow
void exampleWorkflow(cudensitymatHandle_t handle)
{
  // Define the composite Hilbert space shape and
  // quantum state batch size (number of individual quantum states in a batched simulation)
  const std::vector<int64_t> spaceShape({2,2,2,2,2,2,2,2}); // dimensions of quantum degrees of freedom
  const int64_t batchSize = 1;                              // number of quantum states per batch (default is 1)

  if (verbose) {
    std::cout << "Hilbert space rank = " << spaceShape.size() << "; Shape = (";
    for (const auto & dimsn: spaceShape)
      std::cout << dimsn << ",";
    std::cout << ")" << std::endl;
    std::cout << "Quantum state batch size = " << batchSize << std::endl;
  }

  // Construct a user-defined Liouvillian operator using a convenience C++ class
  UserDefinedLiouvillian liouvillian(handle, spaceShape, batchSize);
  if (verbose)
    std::cout << "Constructed the Liouvillian operator\n";

  // Set and place external user-provided Hamiltonian parameters in GPU memory
  const int32_t numParams = liouvillian.getNumParameters(); // number of external user-provided Hamiltonian parameters
  std::vector<double> cpuHamParams(numParams * batchSize);
  for (int64_t j = 0; j < batchSize; ++j) {
    for (int32_t i = 0; i < numParams; ++i) {
      cpuHamParams[j * numParams + i] = double(i+1) / double(j+1); // just setting some parameter values for each instance of the batch
    }
  }
  auto * hamiltonianParams = static_cast<double *>(createInitializeArrayGPU(cpuHamParams));

  // Declare the input quantum state
  cudensitymatState_t inputState;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,
                      &inputState));

  // Query the size of the quantum state storage
  std::size_t storageSize {0}; // only one storage component (tensor) is needed (no tensor factorization)
  HANDLE_CUDM_ERROR(cudensitymatStateGetComponentStorageSize(handle,
                      inputState,
                      1,               // only one storage component (tensor)
                      &storageSize));  // storage size in bytes
  const std::size_t stateVolume = storageSize / sizeof(NumericalType);  // quantum state tensor volume (number of elements)
  if (verbose)
    std::cout << "Quantum state storage size (bytes) = " << storageSize << std::endl;

  // Prepare some initial value for the input quantum state batch
  std::vector<NumericalType> inputStateValue(stateVolume);
  if constexpr (std::is_same_v<NumericalType, float>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = 1.0f / float(i+1); // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, double>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = 1.0 / double(i+1); // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, std::complex<float>>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = NumericalType{1.0f / float(i+1), -1.0f / float(i+2)}; // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, std::complex<double>>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = NumericalType{1.0 / double(i+1), -1.0 / double(i+2)}; // just some value
    }
  } else {
    std::cerr << "Error: Unsupported data type!\n";
    std::exit(1);
  }
  // Allocate initialized GPU storage for the input quantum state with prepared values
  auto * inputStateElems = createInitializeArrayGPU(inputStateValue);
  if (verbose)
    std::cout << "Allocated input quantum state storage and initialized it to some value\n";

  // Attach initialized GPU storage to the input quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      inputState,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({inputStateElems}).data(),      // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed input quantum state\n";

  // Declare the output quantum state of the same shape
  cudensitymatState_t outputState;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,
                      &outputState));

  // Allocate initialized GPU storage for the output quantum state
  auto * outputStateElems = createArrayGPU<NumericalType>(stateVolume);
  if (verbose)
    std::cout << "Allocated output quantum state storage\n";

  // Attach GPU storage to the output quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      outputState,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({outputStateElems}).data(),     // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed output quantum state\n";

  // Declare a workspace descriptor
  cudensitymatWorkspaceDescriptor_t workspaceDescr;
  HANDLE_CUDM_ERROR(cudensitymatCreateWorkspace(handle, &workspaceDescr));

  // Query free GPU memory
  std::size_t freeMem = 0, totalMem = 0;
  HANDLE_CUDA_ERROR(cudaMemGetInfo(&freeMem, &totalMem));
  freeMem = static_cast<std::size_t>(static_cast<double>(freeMem) * 0.95); // take 95% of the free memory for the workspace buffer
  if (verbose)
    std::cout << "Max workspace buffer size (bytes) = " << freeMem << std::endl;

  // Prepare the Liouvillian operator action on a quantum state (needs to be done only once)
  const auto startTime = std::chrono::high_resolution_clock::now();
  HANDLE_CUDM_ERROR(cudensitymatOperatorPrepareAction(handle,
                      liouvillian.get(),
                      inputState,
                      outputState,
                      CUDENSITYMAT_COMPUTE_64F,  // GPU compute type
                      freeMem,                   // max available GPU free memory for the workspace
                      workspaceDescr,            // workspace descriptor
                      0x0));                     // default CUDA stream
  const auto finishTime = std::chrono::high_resolution_clock::now();
  const std::chrono::duration<double> timeSec = finishTime - startTime;
  if (verbose)
    std::cout << "Operator action preparation time (sec) = " << timeSec.count() << std::endl;

  // Query the required workspace buffer size (bytes)
  std::size_t requiredBufferSize {0};
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceGetMemorySize(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      &requiredBufferSize));
  if (verbose)
    std::cout << "Required workspace buffer size (bytes) = " << requiredBufferSize << std::endl;

  // Allocate GPU storage for the workspace buffer
  const std::size_t bufferVolume = requiredBufferSize / sizeof(NumericalType);
  auto * workspaceBuffer = createArrayGPU<NumericalType>(bufferVolume);
  if (verbose)
    std::cout << "Allocated workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Attach the workspace buffer to the workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceSetMemory(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      workspaceBuffer,
                      requiredBufferSize));
  if (verbose)
    std::cout << "Attached workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Apply the Liouvillian operator to the input quatum state
  // and accumulate its action into the output quantum state (note accumulative += semantics)
  for (int32_t repeat = 0; repeat < NUM_REPEATS; ++repeat) { // repeat multiple times for accurate timing
    // Zero out the output quantum state
    HANDLE_CUDM_ERROR(cudensitymatStateInitializeZero(handle,
                        outputState,
                        0x0));
    if (verbose)
      std::cout << "Initialized the output quantum state to zero\n";
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    const auto startTime = std::chrono::high_resolution_clock::now();
    HANDLE_CUDM_ERROR(cudensitymatOperatorComputeAction(handle,
                        liouvillian.get(),
                        0.3,                                   // time point (some value)
                        batchSize,                             // user-defined batch size
                        numParams,                             // number of external user-defined Hamiltonian parameters
                        hamiltonianParams,                     // external Hamiltonian parameters in GPU memory
                        inputState,                            // input quantum state
                        outputState,                           // output quantum state
                        workspaceDescr,                        // workspace descriptor
                        0x0));                                 // default CUDA stream
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    const auto finishTime = std::chrono::high_resolution_clock::now();
    const std::chrono::duration<double> timeSec = finishTime - startTime;
    if (verbose)
      std::cout << "Operator action computation time (sec) = " << timeSec.count() << std::endl;
  }

  // Compute the squared norm of the output quantum state
  void * norm2 = createInitializeArrayGPU(std::vector<double>(batchSize, 0.0));
  HANDLE_CUDM_ERROR(cudensitymatStateComputeNorm(handle,
                      outputState,
                      norm2,
                      0x0));
  if (verbose) {
    std::cout << "Computed the output quantum state norm:\n";
    printArrayGPU<double>(norm2, batchSize);
  }

  HANDLE_CUDA_ERROR(cudaDeviceSynchronize());

  // Destroy the norm2 array
  destroyArrayGPU(norm2);

  // Destroy workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatDestroyWorkspace(workspaceDescr));

  // Destroy workspace buffer storage
  destroyArrayGPU(workspaceBuffer);

  // Destroy quantum states
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(outputState));
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(inputState));

  // Destroy quantum state storage
  destroyArrayGPU(outputStateElems);
  destroyArrayGPU(inputStateElems);

  // Destroy external Hamiltonian parameters
  destroyArrayGPU(static_cast<void *>(hamiltonianParams));

  if (verbose)
    std::cout << "Destroyed resources\n" << std::flush;
}


int main(int argc, char ** argv)
{
  // Initialize MPI library (if needed)
#ifdef MPI_ENABLED
  HANDLE_MPI_ERROR(MPI_Init(&argc, &argv));
  int procRank {-1};
  HANDLE_MPI_ERROR(MPI_Comm_rank(MPI_COMM_WORLD, &procRank));
  int numProcs {0};
  HANDLE_MPI_ERROR(MPI_Comm_size(MPI_COMM_WORLD, &numProcs));
  if (procRank != 0) verbose = false;
  if (verbose)
    std::cout << "Initialized MPI library\n";
#else
  const int procRank {0};
  const int numProcs {1};
#endif

  // Assign a GPU to the process
  int numDevices {0};
  HANDLE_CUDA_ERROR(cudaGetDeviceCount(&numDevices));
  const int deviceId = procRank % numDevices;
  HANDLE_CUDA_ERROR(cudaSetDevice(deviceId));
  if (verbose)
    std::cout << "Set active device\n";

  // Create a library handle
  cudensitymatHandle_t handle;
  HANDLE_CUDM_ERROR(cudensitymatCreate(&handle));
  if (verbose)
    std::cout << "Created a library handle\n";

  // Reset distributed configuration (once)
#ifdef MPI_ENABLED
  MPI_Comm comm;
  HANDLE_MPI_ERROR(MPI_Comm_dup(MPI_COMM_WORLD, &comm));
  HANDLE_CUDM_ERROR(cudensitymatResetDistributedConfiguration(handle,
                      CUDENSITYMAT_DISTRIBUTED_PROVIDER_MPI,
                      &comm, sizeof(comm)));
#endif

  // Run the example
  exampleWorkflow(handle);

  // Synchronize MPI processes
#ifdef MPI_ENABLED
  HANDLE_MPI_ERROR(MPI_Barrier(MPI_COMM_WORLD));
#endif

  // Destroy the library handle
  HANDLE_CUDM_ERROR(cudensitymatDestroy(handle));
  if (verbose)
    std::cout << "Destroyed the library handle\n";

  HANDLE_CUDA_ERROR(cudaDeviceReset());

  // Finalize the MPI library
#ifdef MPI_ENABLED
  HANDLE_MPI_ERROR(MPI_Finalize());
  if (verbose)
    std::cout << "Finalized MPI library\n";
#endif

  // Done
  return 0;
}

NCCL backend (experimental)#

NCCL (NVIDIA Collective Communications Library) can provide better performance for GPU-to-GPU communication, especially within a single node with NVLink connectivity. Below we show an example using NCCL as the communication layer. Note that the NCCL backend is currently experimental. MPI is used for process spawning and bootstrapping (e.g., broadcasting the ncclUniqueId), but all GPU-to-GPU communication uses NCCL.

The full sample code can be found in the NVIDIA/cuQuantum repository (main NCCL code and operator definition as well as the utility code).

Here is the main code for multi-GPU runs using NCCL.

/* Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 */

#include <cudensitymat.h>  // cuDensityMat library header
#include "helpers.h"       // helper functions


// Transverse Ising Hamiltonian with double summation ordering
// and spin-operator fusion, plus fused dissipation terms
#include "transverse_ising_full_fused_noisy.h"  // user-defined Liouvillian operator example


// NCCL library (required for this example)
#ifdef NCCL_ENABLED
#include <nccl.h>
#endif

// MPI library (used for bootstrapping NCCL communicator)
#ifdef MPI_ENABLED
#include <mpi.h>
#endif

#include <cmath>
#include <complex>
#include <vector>
#include <chrono>
#include <iostream>
#include <cassert>


// Number of times to perform operator action on a quantum state
constexpr int NUM_REPEATS = 2;

// Logging verbosity
bool verbose = true;


#ifdef NCCL_ENABLED
// Error handling macro for NCCL
#define HANDLE_NCCL_ERROR(x)                                 \
{                                                            \
  const ncclResult_t err = x;                                \
  if (err != ncclSuccess)                                    \
  {                                                          \
    printf("NCCL Error: %s in line %d\n",                    \
           ncclGetErrorString(err), __LINE__);               \
    fflush(stdout);                                          \
    std::abort();                                            \
  }                                                          \
};
#endif


// Example workflow
void exampleWorkflow(cudensitymatHandle_t handle)
{
  // Define the composite Hilbert space shape and
  // quantum state batch size (number of individual quantum states in a batched simulation)
  const std::vector<int64_t> spaceShape({2,2,2,2,2,2,2,2}); // dimensions of quantum degrees of freedom
  const int64_t batchSize = 1;                              // number of quantum states per batch (default is 1)

  if (verbose) {
    std::cout << "Hilbert space rank = " << spaceShape.size() << "; Shape = (";
    for (const auto & dimsn: spaceShape)
      std::cout << dimsn << ",";
    std::cout << ")" << std::endl;
    std::cout << "Quantum state batch size = " << batchSize << std::endl;
  }

  // Construct a user-defined Liouvillian operator using a convenience C++ class
  UserDefinedLiouvillian liouvillian(handle, spaceShape, batchSize);
  if (verbose)
    std::cout << "Constructed the Liouvillian operator\n";

  // Set and place external user-provided Hamiltonian parameters in GPU memory
  const int32_t numParams = liouvillian.getNumParameters(); // number of external user-provided Hamiltonian parameters
  std::vector<double> cpuHamParams(numParams * batchSize);
  for (int64_t j = 0; j < batchSize; ++j) {
    for (int32_t i = 0; i < numParams; ++i) {
      cpuHamParams[j * numParams + i] = double(i+1) / double(j+1); // just setting some parameter values for each instance of the batch
    }
  }
  auto * hamiltonianParams = static_cast<double *>(createInitializeArrayGPU(cpuHamParams));

  // Declare the input quantum state
  cudensitymatState_t inputState;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,
                      &inputState));

  // Query the size of the quantum state storage
  std::size_t storageSize {0}; // only one storage component (tensor) is needed (no tensor factorization)
  HANDLE_CUDM_ERROR(cudensitymatStateGetComponentStorageSize(handle,
                      inputState,
                      1,               // only one storage component (tensor)
                      &storageSize));  // storage size in bytes
  const std::size_t stateVolume = storageSize / sizeof(NumericalType);  // quantum state tensor volume (number of elements)
  if (verbose)
    std::cout << "Quantum state storage size (bytes) = " << storageSize << std::endl;

  // Prepare some initial value for the input quantum state batch
  std::vector<NumericalType> inputStateValue(stateVolume);
  if constexpr (std::is_same_v<NumericalType, float>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = 1.0f / float(i+1); // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, double>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = 1.0 / double(i+1); // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, std::complex<float>>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = NumericalType{1.0f / float(i+1), -1.0f / float(i+2)}; // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, std::complex<double>>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = NumericalType{1.0 / double(i+1), -1.0 / double(i+2)}; // just some value
    }
  } else {
    std::cerr << "Error: Unsupported data type!\n";
    std::exit(1);
  }
  // Allocate initialized GPU storage for the input quantum state with prepared values
  auto * inputStateElems = createInitializeArrayGPU(inputStateValue);
  if (verbose)
    std::cout << "Allocated input quantum state storage and initialized it to some value\n";

  // Attach initialized GPU storage to the input quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      inputState,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({inputStateElems}).data(),      // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed input quantum state\n";

  // Declare the output quantum state of the same shape
  cudensitymatState_t outputState;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,
                      &outputState));

  // Allocate initialized GPU storage for the output quantum state
  auto * outputStateElems = createArrayGPU<NumericalType>(stateVolume);
  if (verbose)
    std::cout << "Allocated output quantum state storage\n";

  // Attach GPU storage to the output quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      outputState,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({outputStateElems}).data(),     // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed output quantum state\n";

  // Declare a workspace descriptor
  cudensitymatWorkspaceDescriptor_t workspaceDescr;
  HANDLE_CUDM_ERROR(cudensitymatCreateWorkspace(handle, &workspaceDescr));

  // Query free GPU memory
  std::size_t freeMem = 0, totalMem = 0;
  HANDLE_CUDA_ERROR(cudaMemGetInfo(&freeMem, &totalMem));
  freeMem = static_cast<std::size_t>(static_cast<double>(freeMem) * 0.95); // take 95% of the free memory for the workspace buffer
  if (verbose)
    std::cout << "Max workspace buffer size (bytes) = " << freeMem << std::endl;

  // Prepare the Liouvillian operator action on a quantum state (needs to be done only once)
  const auto startTime = std::chrono::high_resolution_clock::now();
  HANDLE_CUDM_ERROR(cudensitymatOperatorPrepareAction(handle,
                      liouvillian.get(),
                      inputState,
                      outputState,
                      CUDENSITYMAT_COMPUTE_64F,  // GPU compute type
                      freeMem,                   // max available GPU free memory for the workspace
                      workspaceDescr,            // workspace descriptor
                      0x0));                     // default CUDA stream
  const auto finishTime = std::chrono::high_resolution_clock::now();
  const std::chrono::duration<double> timeSec = finishTime - startTime;
  if (verbose)
    std::cout << "Operator action preparation time (sec) = " << timeSec.count() << std::endl;

  // Query the required workspace buffer size (bytes)
  std::size_t requiredBufferSize {0};
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceGetMemorySize(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      &requiredBufferSize));
  if (verbose)
    std::cout << "Required workspace buffer size (bytes) = " << requiredBufferSize << std::endl;

  // Allocate GPU storage for the workspace buffer
  const std::size_t bufferVolume = requiredBufferSize / sizeof(NumericalType);
  auto * workspaceBuffer = createArrayGPU<NumericalType>(bufferVolume);
  if (verbose)
    std::cout << "Allocated workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Attach the workspace buffer to the workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceSetMemory(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      workspaceBuffer,
                      requiredBufferSize));
  if (verbose)
    std::cout << "Attached workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Apply the Liouvillian operator to the input quatum state
  // and accumulate its action into the output quantum state (note accumulative += semantics)
  for (int32_t repeat = 0; repeat < NUM_REPEATS; ++repeat) { // repeat multiple times for accurate timing
    // Zero out the output quantum state
    HANDLE_CUDM_ERROR(cudensitymatStateInitializeZero(handle,
                        outputState,
                        0x0));
    if (verbose)
      std::cout << "Initialized the output quantum state to zero\n";
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    const auto startTime = std::chrono::high_resolution_clock::now();
    HANDLE_CUDM_ERROR(cudensitymatOperatorComputeAction(handle,
                        liouvillian.get(),
                        0.3,                                   // time point (some value)
                        batchSize,                             // user-defined batch size
                        numParams,                             // number of external user-defined Hamiltonian parameters
                        hamiltonianParams,                     // external Hamiltonian parameters in GPU memory
                        inputState,                            // input quantum state
                        outputState,                           // output quantum state
                        workspaceDescr,                        // workspace descriptor
                        0x0));                                 // default CUDA stream
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    const auto finishTime = std::chrono::high_resolution_clock::now();
    const std::chrono::duration<double> timeSec = finishTime - startTime;
    if (verbose)
      std::cout << "Operator action computation time (sec) = " << timeSec.count() << std::endl;
  }

  // Compute the squared norm of the output quantum state
  void * norm2 = createInitializeArrayGPU(std::vector<double>(batchSize, 0.0));
  HANDLE_CUDM_ERROR(cudensitymatStateComputeNorm(handle,
                      outputState,
                      norm2,
                      0x0));
  if (verbose) {
    std::cout << "Computed the output quantum state norm:\n";
    printArrayGPU<double>(norm2, batchSize);
  }

  HANDLE_CUDA_ERROR(cudaDeviceSynchronize());

  // Destroy the norm2 array
  destroyArrayGPU(norm2);

  // Destroy workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatDestroyWorkspace(workspaceDescr));

  // Destroy workspace buffer storage
  destroyArrayGPU(workspaceBuffer);

  // Destroy quantum states
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(outputState));
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(inputState));

  // Destroy quantum state storage
  destroyArrayGPU(outputStateElems);
  destroyArrayGPU(inputStateElems);

  // Destroy external Hamiltonian parameters
  destroyArrayGPU(static_cast<void *>(hamiltonianParams));

  if (verbose)
    std::cout << "Destroyed resources\n" << std::flush;
}


int main(int argc, char ** argv)
{
#if defined(NCCL_ENABLED) && defined(MPI_ENABLED)
  // Initialize MPI library (used to bootstrap NCCL)
  HANDLE_MPI_ERROR(MPI_Init(&argc, &argv));
  int procRank {-1};
  HANDLE_MPI_ERROR(MPI_Comm_rank(MPI_COMM_WORLD, &procRank));
  int numProcs {0};
  HANDLE_MPI_ERROR(MPI_Comm_size(MPI_COMM_WORLD, &numProcs));
  if (procRank != 0) verbose = false;
  if (verbose)
    std::cout << "Initialized MPI library (for NCCL bootstrap)\n";

  // Assign a GPU to the process
  int numDevices {0};
  HANDLE_CUDA_ERROR(cudaGetDeviceCount(&numDevices));
  const int deviceId = procRank % numDevices;
  HANDLE_CUDA_ERROR(cudaSetDevice(deviceId));
  if (verbose)
    std::cout << "Set active device to GPU " << deviceId << "\n";

  // Initialize NCCL communicator
  // Step 1: Generate unique ID on rank 0 and broadcast to all ranks
  ncclUniqueId ncclId;
  if (procRank == 0) {
    HANDLE_NCCL_ERROR(ncclGetUniqueId(&ncclId));
  }
  HANDLE_MPI_ERROR(MPI_Bcast(&ncclId, sizeof(ncclId), MPI_BYTE, 0, MPI_COMM_WORLD));

  // Step 2: Initialize NCCL communicator with the shared unique ID
  ncclComm_t ncclComm;
  HANDLE_NCCL_ERROR(ncclCommInitRank(&ncclComm, numProcs, ncclId, procRank));
  if (verbose)
    std::cout << "Initialized NCCL communicator with " << numProcs << " ranks\n";

  // Create a library handle
  cudensitymatHandle_t handle;
  HANDLE_CUDM_ERROR(cudensitymatCreate(&handle));
  if (verbose)
    std::cout << "Created a library handle\n";

  // Reset distributed configuration with NCCL communicator
  // The barrier buffer is now managed internally by cuDensityMat
  HANDLE_CUDM_ERROR(cudensitymatResetDistributedConfiguration(handle,
                      CUDENSITYMAT_DISTRIBUTED_PROVIDER_NCCL,
                      &ncclComm, sizeof(ncclComm)));
  if (verbose)
    std::cout << "Configured distributed execution with NCCL\n";

  // Run the example
  exampleWorkflow(handle);

  // Synchronize processes via NCCL barrier (uses allreduce internally)
  HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
  HANDLE_MPI_ERROR(MPI_Barrier(MPI_COMM_WORLD));

  // Destroy the library handle
  HANDLE_CUDM_ERROR(cudensitymatDestroy(handle));
  if (verbose)
    std::cout << "Destroyed the library handle\n";

  // Finalize NCCL communicator
  HANDLE_NCCL_ERROR(ncclCommFinalize(ncclComm));
  HANDLE_NCCL_ERROR(ncclCommDestroy(ncclComm));
  if (verbose)
    std::cout << "Finalized NCCL communicator\n";

  HANDLE_CUDA_ERROR(cudaDeviceReset());

  // Finalize the MPI library
  HANDLE_MPI_ERROR(MPI_Finalize());
  if (verbose)
    std::cout << "Finalized MPI library\n";

#else
  // Fallback for when NCCL or MPI is not enabled
  (void)argc;
  (void)argv;
  std::cerr << "This example requires both NCCL_ENABLED and MPI_ENABLED to be defined.\n";
  std::cerr << "NCCL uses MPI for bootstrapping (sharing ncclUniqueId across processes).\n";
  std::cerr << "Build with: -DENABLE_NCCL=TRUE -DENABLE_MPI=TRUE\n";
  return 1;
#endif

  // Done
  return 0;
}

Code example (serial execution with backward differentiation)#

The following code example illustrates how to use the cuDensityMat library to not only compute the action of a quantum many-body operator on a quantum state, but also backward-differentiate it (compute gradients) with respect to user-provided real parameters parameterizing the operator (one real parameter Omega in this example). The full sample code can be found in the NVIDIA/cuQuantum repository (main serial gradient code and operator definition for gradient as well as the utility code).

First let’s construct a specific quantum many-body operator which, in this case, is a slightly modified version of the quantum many-body operator used in main serial code. Here we make both the h(t) and f(t) scalar coefficients depend on time and a single user-provided real parameter Omega via different (made-up) functional forms. In order to backward-differentiate the operator action with respect to this single user-provided real parameter Omega, we need to manually define a gradient callback function for each regular callback function we have (for h(t) and f(t) in this example). In our example, we define two CPU-side scalar gradient callback functions which compute the vector-jacobian product (VJP) of the scalar adjoint of h(t) and f(t) with respect to the user-provided real parameter Omega, respectively. A gradient callback function is expected to accumulate the VJP result into the paramsGrad output array, the final value of which will be the gradient(s) of the user-defined cost function with respect to the user-provided real parameters parameterizing the operator. As before, all regular and gradient callback functions used in our example explicitly expect the data type to be CUDA_C_64F (double-precision complex numbers).

/* Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 */

#pragma once

#include <cudensitymat.h> // cuDensityMat library header
#include "helpers.h"      // GPU helper functions

#include <cmath>
#include <complex>
#include <vector>
#include <iostream>
#include <cassert>


/* DESCRIPTION:
   Time-dependent transverse-field Ising Hamiltonian operator
   with ordered and fused ZZ terms, plus fused unitary dissipation terms:
    H = sum_{i} {h_i(t) * X_i}             // transverse field sum of X_i operators with time-dependent h_i(t) coefficients 
      + f(t) * sum_{i < j} {g_ij * ZZ_ij}  // modulated sum of the fused ordered {Z_i * Z_j} terms with static g_ij coefficients
      + d * sum_{i} {Y_i * {..} * Y_i}     // scaled sum of the dissipation terms {Y_i * {..} * Y_i} fused into the YY_ii super-operators
   where {..} is the placeholder for the density matrix to show that the Y_i operators act from different sides.
*/

/** Define the numerical type and data type for the GPU computations (same) */
using NumericalType = std::complex<double>;      // do not change
constexpr cudaDataType_t dataType = CUDA_C_64F;  // do not change


/** Example of a user-provided scalar CPU callback C function
 *  defining a time-dependent coefficient h_i(t) inside the Hamiltonian:
 *  h_i(t) = exp(-Omega * t)
 */
extern "C"
int32_t hCoefComplex64(
  double time,             //in: time point
  int64_t batchSize,       //in: user-defined batch size (number of coefficients in the batch)
  int32_t numParams,       //in: number of external user-provided Hamiltonian parameters (this function expects one parameter, Omega)
  const double * params,   //in: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of user-provided Hamiltonian parameters for all instances of the batch
  cudaDataType_t dataType, //in: data type (expecting CUDA_C_64F in this specific callback function)
  void * scalarStorage,    //inout: CPU-accessible storage for the returned coefficient value(s) of shape [0:batchSize-1]
  cudaStream_t stream)     //in: CUDA stream (default is 0x0)
{
  if (dataType == CUDA_C_64F) {
    auto * tdCoef = static_cast<cuDoubleComplex *>(scalarStorage); // casting to cuDoubleComplex because this callback function expects CUDA_C_64F data type
    for (int64_t i = 0; i < batchSize; ++i) {
      const auto omega = params[i * numParams + 0]; // params[0][i]: 0-th parameter for i-th instance of the batch
      tdCoef[i] = make_cuDoubleComplex(std::exp((-omega) * time), 0.0); // value of the i-th instance of the coefficients batch
    }
  } else {
    return 1; // error code (1: Error)
  }
  return 0; // error code (0: Success)
}


/** User-provided gradient callback function for the user-provided
 *  scalar callback function with respect to its single parameter Omega.
 *  It accumulates a partial derivative 2*Re(dCost/dOmega) = 2*Re(dCost/dCoef * dCoef/dOmega),
 *  where:
 *  - Cost is some user-defined real scalar cost function,
 *  - dCost/dCoef is the adjoint of the cost function with respect to the coefficient (or their batch) associated with the callback function,
 *  - dCoef/dOmega is the gradient of the coefficient (or their batch) with respect to the parameter Omega:
 *    dCoef/dOmega = -time * exp(-Omega * time)
 */
extern "C"
int32_t hCoefGradComplex64(
  double time,             //in: time point
  int64_t batchSize,       //in: user-defined batch size (number of coefficients in the batch)
  int32_t numParams,       //in: number of external user-provided Hamiltonian parameters (this function expects one parameter, Omega)
  const double * params,   //in: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of user-provided Hamiltonian parameters for all instances of the batch
  cudaDataType_t dataType, //in: data type (expecting CUDA_C_64F in this specific callback function)
  void * scalarGrad,       //inout: CPU-accessible storage for the adjoint(s) of the coefficient(s) of shape [0:batchSize-1]
  double * paramsGrad,     //inout: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of the returned gradient(s) of the parameter(s)
  cudaStream_t stream)     //in: CUDA stream (default is 0x0)
{
  if (dataType == CUDA_C_64F) {
    const auto * tdCoefAdjoint = static_cast<const cuDoubleComplex *>(scalarGrad); // casting to cuDoubleComplex because this callback function expects CUDA_C_64F data type
    for (int64_t i = 0; i < batchSize; ++i) {
      const auto omega = params[i * numParams + 0]; // params[0][i]: 0-th parameter for i-th instance of the batch
      paramsGrad[i * numParams + 0] += // IMPORTANT: Accumulate the partial derivative for the i-th instance of the batch, not overwrite it!
        2.0 * cuCreal(cuCmul(tdCoefAdjoint[i], make_cuDoubleComplex(std::exp((-omega) * time) * (-time), 0.0)));
    }
  } else {
    return 1; // error code (1: Error)
  }
  return 0; // error code (0: Success)
}


/** Example of a user-provided scalar CPU callback C function
 *  defining a time-dependent coefficient f(t) inside the Hamiltonian:
 *  f(t) = exp(i * Omega * t) = cos(Omega * t) + i * sin(Omega * t)
 */
extern "C"
int32_t fCoefComplex64(
  double time,             //in: time point
  int64_t batchSize,       //in: user-defined batch size (number of coefficients in the batch)
  int32_t numParams,       //in: number of external user-provided Hamiltonian parameters (this function expects one parameter, Omega)
  const double * params,   //in: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of user-provided Hamiltonian parameters for all instances of the batch
  cudaDataType_t dataType, //in: data type (expecting CUDA_C_64F in this specific callback function)
  void * scalarStorage,    //inout: CPU-accessible storage for the returned coefficient value(s) of shape [0:batchSize-1]
  cudaStream_t stream)     //in: CUDA stream (default is 0x0)
{
  if (dataType == CUDA_C_64F) {
    auto * tdCoef = static_cast<cuDoubleComplex *>(scalarStorage); // casting to cuDoubleComplex because this callback function expects CUDA_C_64F data type
    for (int64_t i = 0; i < batchSize; ++i) {
      const auto omega = params[i * numParams + 0]; // params[0][i]: 0-th parameter for i-th instance of the batch
      tdCoef[i] = make_cuDoubleComplex(std::cos(omega * time), std::sin(omega * time)); // value of the i-th instance of the coefficients batch
    }
  } else {
    return 1; // error code (1: Error)
  }
  return 0; // error code (0: Success)
}


/** User-provided gradient callback function for the user-provided
 *  scalar callback function with respect to its single parameter Omega.
 *  It accumulates a partial derivative 2*Re(dCost/dOmega) = 2*Re(dCost/dCoef * dCoef/dOmega),
 *  where:
 *  - Cost is some user-defined real scalar cost function,
 *  - dCost/dCoef is the adjoint of the cost function with respect to the coefficient associated with the callback function,
 *  - dCoef/dOmega is the gradient of the coefficient with respect to the parameter Omega:
 *    dCoef/dOmega = -i * time * exp(i * Omega * time)
 *                 = -i * time * (cos(Omega * time) + i * sin(Omega * time)) =
 *                 = -time * sin(Omega * time) + i * time * cos(Omega * time)
 */
extern "C"
int32_t fCoefGradComplex64(
  double time,             //in: time point
  int64_t batchSize,       //in: user-defined batch size (number of coefficients in the batch)
  int32_t numParams,       //in: number of external user-provided Hamiltonian parameters (this function expects one parameter, Omega)
  const double * params,   //in: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of user-provided Hamiltonian parameters for all instances of the batch
  cudaDataType_t dataType, //in: data type (expecting CUDA_C_64F in this specific callback function)
  void * scalarGrad,       //inout: CPU-accessible storage for the adjoint(s) of the coefficient(s) of shape [0:batchSize-1]
  double * paramsGrad,     //inout: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of the returned gradient(s) of the parameter(s)
  cudaStream_t stream)     //in: CUDA stream (default is 0x0)
{
  if (dataType == CUDA_C_64F) {
    const auto * tdCoefAdjoint = static_cast<const cuDoubleComplex *>(scalarGrad); // casting to cuDoubleComplex because this callback function expects CUDA_C_64F data type
    for (int64_t i = 0; i < batchSize; ++i) {
      const auto omega = params[i * numParams + 0]; // params[0][i]: 0-th parameter for i-th instance of the batch
      paramsGrad[i * numParams + 0] += // IMPORTANT: Accumulate the partial derivative for the i-th instance of the batch, not overwrite it!
        2.0 * cuCreal(cuCmul(tdCoefAdjoint[i], make_cuDoubleComplex(-std::sin(omega * time) * time, std::cos(omega * time) * time)));
    }
  } else {
    return 1; // error code (1: Error)
  }
  return 0; // error code (0: Success)
}


/** Convenience class which encapsulates a user-defined Liouvillian operator (system Hamiltonian + dissipation terms):
 *  - Constructor constructs the desired Liouvillian operator (`cudensitymatOperator_t`)
 *  - Method `get()` returns a reference to the constructed Liouvillian operator
 *  - Destructor releases all resources used by the Liouvillian operator
 */
class UserDefinedLiouvillian final
{
private:
  // Data members
  cudensitymatHandle_t handle;             // library context handle
  int64_t stateBatchSize;                  // quantum state batch size
  const std::vector<int64_t> spaceShape;   // Hilbert space shape (extents of the modes of the composite Hilbert space)
  void * spinXelems {nullptr};             // elements of the X spin operator in GPU RAM (F-order storage)
  void * spinYYelems {nullptr};            // elements of the fused YY two-spin operator in GPU RAM (F-order storage)
  void * spinZZelems {nullptr};            // elements of the fused ZZ two-spin operator in GPU RAM (F-order storage)
  cudensitymatElementaryOperator_t spinX;  // X spin operator (elementary tensor operator)
  cudensitymatElementaryOperator_t spinYY; // fused YY two-spin operator (elementary tensor operator)
  cudensitymatElementaryOperator_t spinZZ; // fused ZZ two-spin operator (elementary tensor operator)
  cudensitymatOperatorTerm_t oneBodyTerm;  // operator term: H1 = sum_{i} {h_i(t) * X_i} (one-body term)
  cudensitymatOperatorTerm_t twoBodyTerm;  // operator term: H2 = f(t) * sum_{i < j} {g_ij * ZZ_ij} (two-body term)
  cudensitymatOperatorTerm_t noiseTerm;    // operator term: D1 = d * sum_{i} {YY_ii}  // Y_i operators act from different sides on the density matrix (two-body mixed term)
  cudensitymatOperator_t liouvillian;      // full operator: (-i * (H1 + H2) * {..}) + (i * {..} * (H1 + H2)) + D1{..} (super-operator)

public:

  // Constructor constructs a user-defined Liouvillian operator
  UserDefinedLiouvillian(cudensitymatHandle_t contextHandle,             // library context handle
                         const std::vector<int64_t> & hilbertSpaceShape, // Hilbert space shape
                         int64_t batchSize):                             // batch size for the quantum state
    handle(contextHandle), stateBatchSize(batchSize), spaceShape(hilbertSpaceShape)
  {
    // Define the necessary operator tensors in GPU memory (F-order storage!)
    spinXelems = createInitializeArrayGPU<NumericalType>(  // X[i0; j0]
                  {{0.0, 0.0}, {1.0, 0.0},   // 1st column of matrix X
                   {1.0, 0.0}, {0.0, 0.0}}); // 2nd column of matrix X

    spinYYelems = createInitializeArrayGPU<NumericalType>(  // YY[i0, i1; j0, j1] := Y[i0; j0] * Y[i1; j1]
                    {{0.0, 0.0},  {0.0, 0.0}, {0.0, 0.0}, {-1.0, 0.0},  // 1st column of matrix YY
                     {0.0, 0.0},  {0.0, 0.0}, {1.0, 0.0}, {0.0, 0.0},   // 2nd column of matrix YY
                     {0.0, 0.0},  {1.0, 0.0}, {0.0, 0.0}, {0.0, 0.0},   // 3rd column of matrix YY
                     {-1.0, 0.0}, {0.0, 0.0}, {0.0, 0.0}, {0.0, 0.0}}); // 4th column of matrix YY

    spinZZelems = createInitializeArrayGPU<NumericalType>(  // ZZ[i0, i1; j0, j1] := Z[i0; j0] * Z[i1; j1]
                    {{1.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},  {0.0, 0.0},   // 1st column of matrix ZZ
                     {0.0, 0.0}, {-1.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},   // 2nd column of matrix ZZ
                     {0.0, 0.0}, {0.0, 0.0},  {-1.0, 0.0}, {0.0, 0.0},   // 3rd column of matrix ZZ
                     {0.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},  {1.0, 0.0}}); // 4th column of matrix ZZ

    // Construct the necessary Elementary Tensor Operators
    //  X_i operator
    HANDLE_CUDM_ERROR(cudensitymatCreateElementaryOperator(handle,
                        1,                                   // one-body operator
                        std::vector<int64_t>({2}).data(),    // acts in tensor space of shape {2}
                        CUDENSITYMAT_OPERATOR_SPARSITY_NONE, // dense tensor storage
                        0,                                   // 0 for dense tensors
                        nullptr,                             // nullptr for dense tensors
                        dataType,                            // data type
                        spinXelems,                          // tensor elements in GPU memory
                        cudensitymatTensorCallbackNone,      // no tensor callback function (tensor is not time-dependent)
                        cudensitymatTensorGradientCallbackNone, // no tensor gradient callback function
                        &spinX));                            // the created elementary tensor operator
    //  ZZ_ij = Z_i * Z_j fused operator
    HANDLE_CUDM_ERROR(cudensitymatCreateElementaryOperator(handle,
                        2,                                   // two-body operator
                        std::vector<int64_t>({2,2}).data(),  // acts in tensor space of shape {2,2}
                        CUDENSITYMAT_OPERATOR_SPARSITY_NONE, // dense tensor storage
                        0,                                   // 0 for dense tensors
                        nullptr,                             // nullptr for dense tensors
                        dataType,                            // data type
                        spinZZelems,                         // tensor elements in GPU memory
                        cudensitymatTensorCallbackNone,      // no tensor callback function (tensor is not time-dependent)
                        cudensitymatTensorGradientCallbackNone, // no tensor gradient callback function
                        &spinZZ));                           // the created elementary tensor operator
    //  YY_ii = Y_i * {..} * Y_i fused operator (note action from different sides)
    HANDLE_CUDM_ERROR(cudensitymatCreateElementaryOperator(handle,
                        2,                                   // two-body operator
                        std::vector<int64_t>({2,2}).data(),  // acts in tensor space of shape {2,2}
                        CUDENSITYMAT_OPERATOR_SPARSITY_NONE, // dense tensor storage
                        0,                                   // 0 for dense tensors
                        nullptr,                             // nullptr for dense tensors
                        dataType,                            // data type
                        spinYYelems,                         // tensor elements in GPU memory
                        cudensitymatTensorCallbackNone,      // no tensor callback function (tensor is not time-dependent)
                        cudensitymatTensorGradientCallbackNone, // no tensor gradient callback function
                        &spinYY));                           // the created elementary tensor operator

    // Construct the necessary Operator Terms from tensor products of Elementary Tensor Operators
    //  Create an empty operator term
    HANDLE_CUDM_ERROR(cudensitymatCreateOperatorTerm(handle,
                        spaceShape.size(),                   // Hilbert space rank (number of modes)
                        spaceShape.data(),                   // Hilbert space shape (mode extents)
                        &oneBodyTerm));                      // the created empty operator term
    //  Define the operator term: H1 = sum_{i} {h_i(t) * X_i}
    for (int32_t i = 0; i < spaceShape.size(); ++i) {
      HANDLE_CUDM_ERROR(cudensitymatOperatorTermAppendElementaryProduct(handle,
                          oneBodyTerm,
                          1,                                                             // number of elementary tensor operators in the product
                          std::vector<cudensitymatElementaryOperator_t>({spinX}).data(), // elementary tensor operators forming the product
                          std::vector<int32_t>({i}).data(),                              // space modes acted on by the operator product
                          std::vector<int32_t>({0}).data(),                              // space mode action duality (0: from the left; 1: from the right)
                          make_cuDoubleComplex(1.0, 0.0),                                // static coefficient part: Always 64-bit-precision complex number
                          {hCoefComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr},   // CPU scalar callback function defining the time-dependent coefficient associated with this operator product
                          {hCoefGradComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr,
                           CUDENSITYMAT_DIFFERENTIATION_DIR_BACKWARD})); // CPU scalar gradient callback function defining the gradient of the coefficient with respect to the parameter Omega
    }
    //  Create an empty operator term
    HANDLE_CUDM_ERROR(cudensitymatCreateOperatorTerm(handle,
                        spaceShape.size(),                   // Hilbert space rank (number of modes)
                        spaceShape.data(),                   // Hilbert space shape (mode extents)
                        &twoBodyTerm));                      // the created empty operator term
    //  Define the operator term: H2 = f(t) * sum_{i < j} {g_ij * ZZ_ij}
    for (int32_t i = 0; i < spaceShape.size() - 1; ++i) {
      for (int32_t j = (i + 1); j < spaceShape.size(); ++j) {
        const double g_ij = -1.0 / static_cast<double>(i + j + 1); // assign some value to the time-independent g_ij coefficient
        HANDLE_CUDM_ERROR(cudensitymatOperatorTermAppendElementaryProduct(handle,
                            twoBodyTerm,
                            1,                                                              // number of elementary tensor operators in the product
                            std::vector<cudensitymatElementaryOperator_t>({spinZZ}).data(), // elementary tensor operators forming the product
                            std::vector<int32_t>({i, j}).data(),                            // space modes acted on by the operator product
                            std::vector<int32_t>({0, 0}).data(),                            // space mode action duality (0: from the left; 1: from the right)
                            make_cuDoubleComplex(g_ij, 0.0),                                // g_ij static coefficient: Always 64-bit-precision complex number
                            cudensitymatScalarCallbackNone,                                 // no time-dependent coefficient associated with this operator product
                            cudensitymatScalarGradientCallbackNone));                       // no coefficient gradient associated with this operator product
      }
    }
    //  Create an empty operator term
    HANDLE_CUDM_ERROR(cudensitymatCreateOperatorTerm(handle,
                        spaceShape.size(),                   // Hilbert space rank (number of modes)
                        spaceShape.data(),                   // Hilbert space shape (mode extents)
                        &noiseTerm));                        // the created empty operator term
    //  Define the operator term: D1 = d * sum_{i} {YY_ii}
    for (int32_t i = 0; i < spaceShape.size(); ++i) {
      HANDLE_CUDM_ERROR(cudensitymatOperatorTermAppendElementaryProduct(handle,
                          noiseTerm,
                          1,                                                              // number of elementary tensor operators in the product
                          std::vector<cudensitymatElementaryOperator_t>({spinYY}).data(), // elementary tensor operators forming the product
                          std::vector<int32_t>({i, i}).data(),                            // space modes acted on by the operator product (from different sides)
                          std::vector<int32_t>({0, 1}).data(),                            // space mode action duality (0: from the left; 1: from the right)
                          make_cuDoubleComplex(1.0, 0.0),                                 // default coefficient: Always 64-bit-precision complex number
                          cudensitymatScalarCallbackNone,                                 // no time-dependent coefficient associated with this operator product
                          cudensitymatScalarGradientCallbackNone));                       // no coefficient gradient associated with this operator product
    }

    // Construct the full Liouvillian operator as a sum of the created operator terms
    //  Create an empty operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatCreateOperator(handle,
                        spaceShape.size(),                // Hilbert space rank (number of modes)
                        spaceShape.data(),                // Hilbert space shape (modes extents)
                        &liouvillian));                   // the created empty operator (super-operator)
    //  Append an operator term to the operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        oneBodyTerm,                      // appended operator term
                        0,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(0.0, -1.0),  // -i constant
                        cudensitymatScalarCallbackNone,   // no time-dependent coefficient associated with the operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with the operator term as a whole
    //  Append an operator term to the operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        twoBodyTerm,                      // appended operator term
                        0,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(0.0, -1.0),  // -i constant
                        {fCoefComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr}, // CPU scalar callback function defining the time-dependent coefficient associated with this operator term as a whole
                        {fCoefGradComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr,
                         CUDENSITYMAT_DIFFERENTIATION_DIR_BACKWARD})); // CPU scalar gradient callback function defining the gradient of the coefficient with respect to the parameter Omega
    //  Append an operator term to the operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        oneBodyTerm,                      // appended operator term
                        1,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(0.0, +1.0),  // +i constant
                        cudensitymatScalarCallbackNone,   // no time-dependent coefficient associated with the operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with the operator term as a whole
    //  Append an operator term to the operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        twoBodyTerm,                      // appended operator term
                        1,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(0.0, 1.0),   // +i constant
                        {fCoefComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr}, // CPU scalar callback function defining the time-dependent coefficient associated with this operator term as a whole
                        {fCoefGradComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr,
                         CUDENSITYMAT_DIFFERENTIATION_DIR_BACKWARD})); // CPU scalar gradient callback function defining the gradient of the coefficient with respect to the parameter Omega
    //  Append an operator term to the operator (super-operator)
    const double d = 1.0; // assign some value to the time-independent coefficient
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        noiseTerm,                        // appended operator term
                        0,                                // operator term action duality as a whole (no duality reversing in this case)
                        make_cuDoubleComplex(d, 0.0),     // static coefficient associated with the operator term as a whole
                        cudensitymatScalarCallbackNone,   // no time-dependent coefficient associated with the operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with the operator term as a whole
  }

  // Destructor destructs the user-defined Liouvillian operator
  ~UserDefinedLiouvillian()
  {
    // Destroy the Liouvillian operator
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperator(liouvillian));

    // Destroy operator terms
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorTerm(noiseTerm));
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorTerm(twoBodyTerm));
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorTerm(oneBodyTerm));

    // Destroy elementary tensor operators
    HANDLE_CUDM_ERROR(cudensitymatDestroyElementaryOperator(spinYY));
    HANDLE_CUDM_ERROR(cudensitymatDestroyElementaryOperator(spinZZ));
    HANDLE_CUDM_ERROR(cudensitymatDestroyElementaryOperator(spinX));

    // Destroy operator tensors
    destroyArrayGPU(spinYYelems);
    destroyArrayGPU(spinZZelems);
    destroyArrayGPU(spinXelems);
  }

  // Disable copy constructor/assignment (GPU resources are private, no deep copy)
  UserDefinedLiouvillian(const UserDefinedLiouvillian &) = delete;
  UserDefinedLiouvillian & operator=(const UserDefinedLiouvillian &) = delete;
  UserDefinedLiouvillian(UserDefinedLiouvillian &&) = delete;
  UserDefinedLiouvillian & operator=(UserDefinedLiouvillian &&) = delete;

  /** Returns the number of externally provided Hamiltonian parameters. */
  int32_t getNumParameters() const
  {
    return 1; // one parameter Omega
  }

  /** Get access to the constructed Liouvillian operator. */
  cudensitymatOperator_t & get()
  {
    return liouvillian;
  }

};

Now we can use the defined quantum many-body operator in our main code to compute its action on a mixed quantum state and then backward-differentiate it (compute gradients) with respect to the user-provided real parameter Omega. For the sake of simplicity, we pass a made-up adjoint of the output quantum state to the cudensitymatOperatorComputeActionBackwardDiff() call, which is just the output quantum state itself (in real scenarios, the adjoint of the output quantum state will depend on the user-chosen cost function and will be provided by the user). Upon completion of the cudensitymatOperatorComputeActionBackwardDiff() call, the paramsGrad output argument will contain the gradient of the user-defined cost function with respect to the user-provided real parameter Omega. Additionally, the backward-differentiation API call will also return the adjoint of the input quantum state for cases where the input quantum state implicitly depends on the user-provided real parameters (for example, cases where the input quantum state comes from a previous operator action step, which is typical for time-integration of quantum dynamics master equations). Note that both output arguments, namely paramsGrad and stateInAdj, are accumulative, i.e., they will be accumulated into (it is user’s responsibility to zero them out before the first call!).

/* Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 */

#include <cudensitymat.h>  // cuDensityMat library header
#include "helpers.h"       // helper functions


// Transverse Ising Hamiltonian with double summation ordering
// and spin-operator fusion, plus fused dissipation terms
#include "transverse_ising_full_fused_noisy_grad.h" // user-defined Liouvillian operator example

#include <cmath>
#include <complex>
#include <vector>
#include <chrono>
#include <iostream>
#include <cassert>


// Number of times to perform operator action on a quantum state
constexpr int NUM_REPEATS = 2;

// Logging verbosity
bool verbose = true;


// Example workflow
void exampleWorkflow(cudensitymatHandle_t handle)
{
  // Define the composite Hilbert space shape and
  // quantum state batch size (number of individual quantum states in a batched simulation)
  const std::vector<int64_t> spaceShape({2,2,2,2}); // dimensions of quantum degrees of freedom
  const int64_t batchSize = 1;                      // number of quantum states per batch

  if (verbose) {
    std::cout << "Hilbert space rank = " << spaceShape.size() << "; Shape = (";
    for (const auto & dimsn: spaceShape)
      std::cout << dimsn << ",";
    std::cout << ")" << std::endl;
    std::cout << "Quantum state batch size = " << batchSize << std::endl;
  }

  // Construct a user-defined Liouvillian operator using a convenience C++ class
  UserDefinedLiouvillian liouvillian(handle, spaceShape, batchSize);
  if (verbose)
    std::cout << "Constructed the Liouvillian operator\n";

  // Set and place external user-provided Hamiltonian parameters in GPU memory
  const int32_t numParams = liouvillian.getNumParameters(); // number of external user-provided Hamiltonian parameters
  if (verbose)
    std::cout << "Number of external user-provided Hamiltonian parameters = " << numParams << std::endl;
  std::vector<double> cpuHamParams(numParams * batchSize);
  for (int64_t j = 0; j < batchSize; ++j) {
    for (int32_t i = 0; i < numParams; ++i) {
      cpuHamParams[j * numParams + i] = double(i+1) / double(j+1); // just setting some parameter values for each instance of the batch
    }
  }
  auto * hamiltonianParams = static_cast<double *>(createInitializeArrayGPU(cpuHamParams));
  if (verbose)
    std::cout << "Created an array of external user-provided Hamiltonian parameters in GPU memory\n";

  // Create an array of gradients for the user-provided Hamiltonian parameters in GPU memory
  std::vector<double> cpuHamParamsGrad(numParams * batchSize, 0.0);
  auto * hamiltonianParamsGrad = static_cast<double *>(createInitializeArrayGPU(cpuHamParamsGrad));
  if (verbose)
    std::cout << "Created an array of gradients for the external user-provided Hamiltonian parameters in GPU memory\n";

  // Declare the input quantum state
  cudensitymatState_t inputState;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,
                      &inputState));

  // Query the size of the quantum state storage
  std::size_t storageSize {0}; // only one storage component (tensor) is needed (no tensor factorization)
  HANDLE_CUDM_ERROR(cudensitymatStateGetComponentStorageSize(handle,
                      inputState,
                      1,               // only one storage component (tensor)
                      &storageSize));  // storage size in bytes
  const std::size_t stateVolume = storageSize / sizeof(NumericalType);  // quantum state tensor volume (number of elements)
  if (verbose)
    std::cout << "Quantum state storage size (bytes) = " << storageSize << std::endl;

  // Prepare some initial value for the input quantum state batch
  std::vector<NumericalType> inputStateValue(stateVolume);
  if constexpr (std::is_same_v<NumericalType, float>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = 1.0f / float(i+1); // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, double>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = 1.0 / double(i+1); // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, std::complex<float>>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = NumericalType{1.0f / float(i+1), -1.0f / float(i+2)}; // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, std::complex<double>>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = NumericalType{1.0 / double(i+1), -1.0 / double(i+2)}; // just some value
    }
  } else {
    std::cerr << "Error: Unsupported data type!\n";
    std::exit(1);
  }
  // Allocate initialized GPU storage for the input quantum state with prepared values
  auto * inputStateElems = createInitializeArrayGPU(inputStateValue);
  if (verbose)
    std::cout << "Allocated input quantum state storage and initialized it to some value\n";

  // Attach initialized GPU storage to the input quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      inputState,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({inputStateElems}).data(),      // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed input quantum state\n";

  // Declare the output quantum state of the same shape
  cudensitymatState_t outputState;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,
                      &outputState));

  // Allocate GPU storage for the output quantum state
  auto * outputStateElems = createArrayGPU<NumericalType>(stateVolume);
  if (verbose)
    std::cout << "Allocated output quantum state storage\n";

  // Attach GPU storage to the output quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      outputState,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({outputStateElems}).data(),     // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed output quantum state\n";

  // Declare the adjoint input quantum state of the same shape
  cudensitymatState_t inputStateAdj;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,  // data type must match
                      &inputStateAdj));

  // Allocate GPU storage for the adjoint input quantum state
  auto * inputStateAdjElems = createArrayGPU<NumericalType>(stateVolume);
  if (verbose)
    std::cout << "Allocated adjoint input quantum state storage\n";

  // Attach GPU storage to the adjoint input quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      inputStateAdj,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({inputStateAdjElems}).data(),   // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed adjoint input quantum state\n";

  // Declare a workspace descriptor
  cudensitymatWorkspaceDescriptor_t workspaceDescr;
  HANDLE_CUDM_ERROR(cudensitymatCreateWorkspace(handle, &workspaceDescr));

  // Query free GPU memory
  std::size_t freeMem = 0, totalMem = 0;
  HANDLE_CUDA_ERROR(cudaMemGetInfo(&freeMem, &totalMem));
  freeMem = static_cast<std::size_t>(static_cast<double>(freeMem) * 0.45); // take 45% of the free memory for the workspace buffer
  if (verbose)
    std::cout << "Max workspace buffer size (bytes) = " << freeMem << std::endl;

  // Allocate GPU storage for the workspace buffer
  const std::size_t bufferVolume = freeMem / sizeof(NumericalType);
  auto * workspaceBuffer = createArrayGPU<NumericalType>(bufferVolume);
  if (verbose)
    std::cout << "Allocated workspace buffer of size (bytes) = " << freeMem << std::endl;

  // Prepare the Liouvillian operator action on a quantum state (needs to be done only once)
  auto startTime = std::chrono::high_resolution_clock::now();
  HANDLE_CUDM_ERROR(cudensitymatOperatorPrepareAction(handle,
                      liouvillian.get(),
                      inputState,
                      outputState,
                      CUDENSITYMAT_COMPUTE_64F,  // GPU compute type
                      freeMem,                   // max available GPU free memory for the workspace
                      workspaceDescr,            // workspace descriptor
                      0x0));                     // default CUDA stream
  auto finishTime = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> timeSec = finishTime - startTime;
  if (verbose)
    std::cout << "Operator action preparation time (sec) = " << timeSec.count() << std::endl;

  // Query the required workspace buffer size (bytes)
  std::size_t requiredBufferSize {0};
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceGetMemorySize(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      &requiredBufferSize));
  if (verbose)
    std::cout << "Required workspace buffer size (bytes) = " << requiredBufferSize << std::endl;

  if (requiredBufferSize > freeMem) {
    std::cerr << "Error: Required workspace buffer size is greater than the available GPU free memory!\n";
    std::exit(1);
  }

  // Attach the workspace buffer to the workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceSetMemory(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      workspaceBuffer,
                      requiredBufferSize));
  if (verbose)
    std::cout << "Attached workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Apply the Liouvillian operator to the input quatum state
  // and accumulate its action into the output quantum state (note the accumulative += semantics)
  for (int32_t repeat = 0; repeat < NUM_REPEATS; ++repeat) { // repeat multiple times for accurate timing
    // Zero out the output quantum state
    HANDLE_CUDM_ERROR(cudensitymatStateInitializeZero(handle,
                        outputState,
                        0x0));
    if (verbose)
      std::cout << "Initialized the output quantum state to zero\n";
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    startTime = std::chrono::high_resolution_clock::now();
    HANDLE_CUDM_ERROR(cudensitymatOperatorComputeAction(handle,
                        liouvillian.get(),
                        0.3,                // time point (some value)
                        batchSize,          // user-defined batch size
                        numParams,          // number of external user-defined Hamiltonian parameters
                        hamiltonianParams,  // external Hamiltonian parameters in GPU memory
                        inputState,         // input quantum state
                        outputState,        // output quantum state
                        workspaceDescr,     // workspace descriptor
                        0x0));              // default CUDA stream
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    finishTime = std::chrono::high_resolution_clock::now();
    timeSec = finishTime - startTime;
    if (verbose)
      std::cout << "Operator action computation time (sec) = " << timeSec.count() << std::endl;
  }

  // Compute the squared norm of the output quantum state
  void * norm2 = createInitializeArrayGPU(std::vector<double>(batchSize, 0.0));
  HANDLE_CUDM_ERROR(cudensitymatStateComputeNorm(handle,
                      outputState,
                      norm2,
                      0x0));
  if (verbose) {
    std::cout << "Computed the output quantum state norm:\n";
    printArrayGPU<double>(norm2, batchSize);
  }

  HANDLE_CUDA_ERROR(cudaDeviceSynchronize());

  // Prepare the Liouvillian operator action backward differentiation (needs to be done only once)
  startTime = std::chrono::high_resolution_clock::now();
  HANDLE_CUDM_ERROR(cudensitymatOperatorPrepareActionBackwardDiff(handle,
                      liouvillian.get(),
                      inputState,
                      outputState,               // adjoint output quantum state is always congruent to the output quantum state
                      CUDENSITYMAT_COMPUTE_64F,  // GPU compute type
                      freeMem,                   // max available GPU free memory for the workspace buffer
                      workspaceDescr,            // workspace descriptor
                      0x0));                     // default CUDA stream
  finishTime = std::chrono::high_resolution_clock::now();
  timeSec = finishTime - startTime;
  if (verbose)
    std::cout << "Operator action backward differentiation preparation time (sec) = " << timeSec.count() << std::endl;

  // Query the required workspace buffer size (bytes)
  requiredBufferSize = 0;
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceGetMemorySize(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      &requiredBufferSize));
  if (verbose)
    std::cout << "Required workspace buffer size (bytes) = " << requiredBufferSize << std::endl;

  if (requiredBufferSize > freeMem) {
    std::cerr << "Error: Required workspace buffer size is greater than the available GPU free memory!\n";
    std::exit(1);
  }

  // Attach the workspace buffer to the workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceSetMemory(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      workspaceBuffer,
                      requiredBufferSize));
  if (verbose)
    std::cout << "Attached workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Liouvillian operator action backward differentiation:
  // The adjoint output quantum state, which is always congruent to the output quantum state,
  // depends on the user-defined cost function, so here we simply pass the previously computed output quantum state.
  // In real-life applications, the user will pass their adjoint output quantum state, computed for their cost function.
  for (int32_t repeat = 0; repeat < NUM_REPEATS; ++repeat) { // repeat multiple times for accurate timing
    // Zero out the adjoint input quantum state and gradients
    HANDLE_CUDM_ERROR(cudensitymatStateInitializeZero(handle,
                        inputStateAdj,
                        0x0));
    initializeArrayGPU(std::vector<double>(numParams * batchSize, 0.0), hamiltonianParamsGrad);
    if (verbose)
      std::cout << "Initialized the adjoint input quantum state and gradients to zero\n";
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    startTime = std::chrono::high_resolution_clock::now();
    HANDLE_CUDM_ERROR(cudensitymatOperatorComputeActionBackwardDiff(handle,
                        liouvillian.get(),
                        0.3,                    // time point (some value)
                        batchSize,              // user-defined batch size
                        numParams,              // number of external user-defined Hamiltonian parameters
                        hamiltonianParams,      // external Hamiltonian parameters in GPU memory
                        inputState,             // input quantum state
                        outputState,            // adjoint output quantum state (here we just pass the previously computed output quantum state for simplicity)
                        inputStateAdj,          // adjoint input quantum state
                        hamiltonianParamsGrad,  // partial gradients with respect to the user-defined real parameters
                        workspaceDescr,         // workspace descriptor
                        0x0));                  // default CUDA stream
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    finishTime = std::chrono::high_resolution_clock::now();
    timeSec = finishTime - startTime;
    if (verbose)
      std::cout << "Operator action backward differentiation computation time (sec) = " << timeSec.count() << std::endl;
  }

  // Compute the squared norm of the adjoint input quantum state
  initializeArrayGPU(std::vector<double>(batchSize, 0.0), norm2);
  HANDLE_CUDM_ERROR(cudensitymatStateComputeNorm(handle,
                      inputStateAdj,
                      norm2,
                      0x0));
  if (verbose) {
    std::cout << "Computed the adjoint input quantum state norm:\n";
    printArrayGPU<double>(norm2, batchSize);
    std::cout << "Hamiltonian parameters gradients:\n";
    printArrayGPU<double>(hamiltonianParamsGrad, numParams * batchSize);
  }

  HANDLE_CUDA_ERROR(cudaDeviceSynchronize());

  // Destroy the norm2 array
  destroyArrayGPU(norm2);

  // Destroy workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatDestroyWorkspace(workspaceDescr));

  // Destroy workspace buffer storage
  destroyArrayGPU(workspaceBuffer);

  // Destroy quantum states
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(inputStateAdj));
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(outputState));
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(inputState));

  // Destroy quantum state storage
  destroyArrayGPU(inputStateAdjElems);
  destroyArrayGPU(outputStateElems);
  destroyArrayGPU(inputStateElems);

  // Destroy external Hamiltonian parameters
  destroyArrayGPU(static_cast<void *>(hamiltonianParamsGrad));
  destroyArrayGPU(static_cast<void *>(hamiltonianParams));

  if (verbose)
    std::cout << "Destroyed resources\n" << std::flush;
}


int main(int argc, char ** argv)
{
  // Assign a GPU to the process
  HANDLE_CUDA_ERROR(cudaSetDevice(0));
  if (verbose)
    std::cout << "Set active device\n";

  // Create a library handle
  cudensitymatHandle_t handle;
  HANDLE_CUDM_ERROR(cudensitymatCreate(&handle));
  if (verbose)
    std::cout << "Created a library handle\n";

  // Run the example
  exampleWorkflow(handle);

  // Destroy the library handle
  HANDLE_CUDM_ERROR(cudensitymatDestroy(handle));
  if (verbose)
    std::cout << "Destroyed the library handle\n";

  HANDLE_CUDA_ERROR(cudaDeviceReset());

  // Done
  return 0;
}

Code example (serial batched execution with backward differentiation)#

The following code example extends backward differentiation to batched operators and quantum states. Here the Hamiltonian operator contains batched coefficients such that each instance of a batched quantum state is acted on by a different instance of the batched operator (with different coefficient values). The full sample code can be found in the NVIDIA/cuQuantum repository (main serial batched gradient code and operator definition for batched gradient as well as the utility code).

The Hamiltonian definition makes both the h(t) and f(t) scalar coefficients batched, requiring user-supplied vector storage for their static and dynamic (total) values. The corresponding scalar and scalar gradient callback functions also operate on a batch instead of a single instance. Furthermore, both params and paramsGrad arrays become truly two-dimensional arrays, with the first dimension corresponding to the number of user-provided real parameters and the second dimension corresponding to the batch size.

/* Copyright (c) 2026-2026, NVIDIA CORPORATION & AFFILIATES.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 */

#pragma once

#include <cudensitymat.h> // cuDensityMat library header
#include "helpers.h"      // GPU helper functions

#include <cmath>
#include <complex>
#include <vector>
#include <iostream>
#include <cassert>


/* DESCRIPTION:
   Batched time-dependent transverse-field Ising Hamiltonian operator
   with ordered and fused ZZ terms, plus fused unitary dissipation terms:
    H[k] = sum_{i} {h_i(t)[k] * X_i}          // transverse-field sum of X_i operators with batched time-dependent h_i(t)[k] coefficients 
      + f(t)[k] * sum_{i < j} {g_ij * ZZ_ij}  // batched modulated sum of the fused ordered {Z_i * Z_j} terms with static g_ij coefficients
      + d * sum_{i} {Y_i * {..} * Y_i}        // scaled sum of the dissipation terms {Y_i * {..} * Y_i} fused into the YY_ii super-operators
   where {..} is the placeholder for the density matrix to show that the Y_i operators act from different sides.
*/

/** Define the numerical type and data type for the GPU computations (same) */
using NumericalType = std::complex<double>;      // do not change
constexpr cudaDataType_t dataType = CUDA_C_64F;  // do not change


/** User-provided batched scalar CPU callback C function
 *  defining a batched time-dependent coefficient h_i(t) for all instances
 *  of the batch inside the Hamiltonian:
 *  h_i(t)[k] = exp(-Omega[k] * t) for k = 0, ..., batchSize-1
 */
extern "C"
int32_t hCoefBatchComplex64(
  double time,             //in: time point
  int64_t batchSize,       //in: user-defined batch size (number of coefficients in the batch)
  int32_t numParams,       //in: number of external user-provided Hamiltonian parameters (this function expects one parameter, Omega)
  const double * params,   //in: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of user-provided Hamiltonian parameters for all instances of the batch
  cudaDataType_t dataType, //in: data type (expecting CUDA_C_64F in this specific callback function)
  void * scalarStorage,    //inout: CPU-accessible storage for the returned batched coefficient values of shape [0:batchSize-1]
  cudaStream_t stream)     //in: CUDA stream (default is 0x0)
{
  if (dataType == CUDA_C_64F) {
    auto * tdCoef = static_cast<cuDoubleComplex *>(scalarStorage); // casting to cuDoubleComplex because this callback function expects CUDA_C_64F data type
    for (int64_t k = 0; k < batchSize; ++k) { // for each instance of the batch
      const auto omega = params[k * numParams + 0]; // params[0][k]: 0-th parameter for k-th instance of the batch
      tdCoef[k] = make_cuDoubleComplex(std::exp((-omega) * time), 0.0); // value of the k-th instance of the batched coefficients
    }
  } else {
    return 1; // error code (1: Error)
  }
  return 0; // error code (0: Success)
}


/** User-provided batched scalar gradient callback function (CPU-side) for the user-provided
 *  batched scalar callback function hCoefBatchComplex64, defining the gradients with respect
 *  to its single (batched) parameter Omega. It accumulates a partial derivative:
 *    2*Re(dCost/dOmega[k]) = 2*Re(dCost/dCoef[k] * dCoef[k]/dOmega[k]),
 *  where:
 *  - Cost is some user-defined real scalar cost function,
 *  - dCost/dCoef[k] is the adjoint of the cost function with respect to the k-th instance of the batched coefficient associated with the callback function,
 *  - dCoef[k]/dOmega[k] is the gradient of the k-th instance of the batched coefficient with respect to the parameter Omega[k]:
 *    dCoef[k]/dOmega[k] = -t * exp(-Omega[k] * t) for k = 0, ..., batchSize-1
 */
extern "C"
int32_t hCoefBatchGradComplex64(
  double time,             //in: time point
  int64_t batchSize,       //in: user-defined batch size (number of coefficients in the batch)
  int32_t numParams,       //in: number of external user-provided Hamiltonian parameters (this function expects one parameter, Omega)
  const double * params,   //in: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of user-provided Hamiltonian parameters for all instances of the batch
  cudaDataType_t dataType, //in: data type (expecting CUDA_C_64F in this specific callback function)
  void * scalarGrad,       //in: CPU-accessible storage for the batched adjoint of the batched coefficient of shape [0:batchSize-1]
  double * paramsGrad,     //inout: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of the returned gradients of the parameter(s) for all instances of the batch
  cudaStream_t stream)     //in: CUDA stream (default is 0x0)
{
  if (dataType == CUDA_C_64F) {
    const auto * tdCoefAdjoint = static_cast<const cuDoubleComplex *>(scalarGrad); // casting to cuDoubleComplex because this callback function expects CUDA_C_64F data type
    for (int64_t k = 0; k < batchSize; ++k) { // for each instance of the batch
      const auto omega = params[k * numParams + 0]; // params[0][k]: 0-th parameter for k-th instance of the batch
      paramsGrad[k * numParams + 0] += // IMPORTANT: Accumulate the partial derivative for the k-th instance of the batch, not overwrite it!
        2.0 * cuCreal(cuCmul(tdCoefAdjoint[k], make_cuDoubleComplex(std::exp((-omega) * time) * (-time), 0.0)));
    }
  } else {
    return 1; // error code (1: Error)
  }
  return 0; // error code (0: Success)
}


/** User-provided batched scalar CPU callback C function
 *  defining a batched time-dependent coefficient f(t) for all instances
 *  of the batch inside the Hamiltonian:
 *  f(t)[k] = exp(i * Omega[k] * t)
 *          = cos(Omega[k] * t) + i * sin(Omega[k] * t) for k = 0, ..., batchSize-1
 */
extern "C"
int32_t fCoefBatchComplex64(
  double time,             //in: time point
  int64_t batchSize,       //in: user-defined batch size (number of coefficients in the batch)
  int32_t numParams,       //in: number of external user-provided Hamiltonian parameters (this function expects one parameter, Omega)
  const double * params,   //in: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of user-provided Hamiltonian parameters for all instances of the batch
  cudaDataType_t dataType, //in: data type (expecting CUDA_C_64F in this specific callback function)
  void * scalarStorage,    //inout: CPU-accessible storage for the returned batched coefficient values of shape [0:batchSize-1]
  cudaStream_t stream)     //in: CUDA stream (default is 0x0)
{
  if (dataType == CUDA_C_64F) {
    auto * tdCoef = static_cast<cuDoubleComplex *>(scalarStorage); // casting to cuDoubleComplex because this callback function expects CUDA_C_64F data type
    for (int64_t k = 0; k < batchSize; ++k) { // for each instance of the batch
      const auto omega = params[k * numParams + 0]; // params[0][k]: 0-th parameter for k-th instance of the batch
      tdCoef[k] = make_cuDoubleComplex(std::cos(omega * time), std::sin(omega * time)); // value of the k-th instance of the batched coefficients
    }
  } else {
    return 1; // error code (1: Error)
  }
  return 0; // error code (0: Success)
}


/** User-provided batched scalar gradient callback function (CPU-side) for the user-provided
 *  batched scalar callback function fCoefBatchComplex64, defining the gradients with respect
 *  to its single (batched) parameter Omega. It accumulates a partial derivative:
 *    2*Re(dCost/dOmega[k]) = 2*Re(dCost/dCoef[k] * dCoef[k]/dOmega[k]),
 *  where:
 *  - Cost is some user-defined real scalar cost function,
 *  - dCost/dCoef[k] is the adjoint of the cost function with respect to the k-th instance of the batched coefficient associated with the callback function,
 *  - dCoef[k]/dOmega[k] is the gradient of the k-th instance of the batched coefficient with respect to the parameter Omega[k]:
 *    dCoef[k]/dOmega[k] = i * t * exp(i * Omega[k] * t)
 *                       = i * t * cos(Omega[k] * t) - t * sin(Omega[k] * t) for k = 0, ..., batchSize-1
 */
extern "C"
int32_t fCoefBatchGradComplex64(
  double time,             //in: time point
  int64_t batchSize,       //in: user-defined batch size (number of coefficients in the batch)
  int32_t numParams,       //in: number of external user-provided Hamiltonian parameters (this function expects one parameter, Omega)
  const double * params,   //in: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of user-provided Hamiltonian parameters for all instances of the batch
  cudaDataType_t dataType, //in: data type (expecting CUDA_C_64F in this specific callback function)
  void * scalarGrad,       //in: CPU-accessible storage for the batched adjoint of the batched coefficient of shape [0:batchSize-1]
  double * paramsGrad,     //inout: params[0:numParams-1][0:batchSize-1]: GPU-accessible F-ordered array of the returned gradients of the parameter(s) for all instances of the batch
  cudaStream_t stream)     //in: CUDA stream (default is 0x0)
{
  if (dataType == CUDA_C_64F) {
    const auto * tdCoefAdjoint = static_cast<const cuDoubleComplex *>(scalarGrad); // casting to cuDoubleComplex because this callback function expects CUDA_C_64F data type
    for (int64_t k = 0; k < batchSize; ++k) {
      const auto omega = params[k * numParams + 0]; // params[0][k]: 0-th parameter for k-th instance of the batch
      paramsGrad[k * numParams + 0] += // IMPORTANT: Accumulate the partial derivative for the k-th instance of the batch, not overwrite it!
        2.0 * cuCreal(cuCmul(tdCoefAdjoint[k], make_cuDoubleComplex(-std::sin(omega * time) * time, std::cos(omega * time) * time)));
    }
  } else {
    return 1; // error code (1: Error)
  }
  return 0; // error code (0: Success)
}


/** Convenience class which encapsulates a user-defined Liouvillian operator (system Hamiltonian + dissipation terms):
 *  - Constructor constructs the desired Liouvillian operator (`cudensitymatOperator_t`)
 *  - Method `get()` returns a reference to the constructed Liouvillian operator
 *  - Destructor releases all resources used by the Liouvillian operator
 */
class UserDefinedLiouvillian final
{
private:
  // Data members
  cudensitymatHandle_t handle;             // library context handle
  int64_t operBatchSize;                   // batch size for the super-operator
  const std::vector<int64_t> spaceShape;   // Hilbert space shape (extents of the modes of the composite Hilbert space)
  void * spinXelems {nullptr};             // elements of the X spin operator in GPU RAM (F-order storage)
  void * spinYYelems {nullptr};            // elements of the fused YY two-spin operator in GPU RAM (F-order storage)
  void * spinZZelems {nullptr};            // elements of the fused ZZ two-spin operator in GPU RAM (F-order storage)
  cudensitymatElementaryOperator_t spinX;  // X spin operator (elementary tensor operator)
  cudensitymatElementaryOperator_t spinYY; // fused YY two-spin operator (elementary tensor operator)
  cudensitymatElementaryOperator_t spinZZ; // fused ZZ two-spin operator (elementary tensor operator)
  cudensitymatOperatorTerm_t oneBodyTerm;  // operator term: H1 = sum_{i} {h_i(t) * X_i} (one-body term)
  cudensitymatOperatorTerm_t twoBodyTerm;  // operator term: H2 = f(t) * sum_{i < j} {g_ij * ZZ_ij} (two-body term)
  cudensitymatOperatorTerm_t noiseTerm;    // operator term: D1 = d * sum_{i} {YY_ii}  // Y_i operators act from different sides on the density matrix (two-body mixed term)
  // Batched coefficients
  cuDoubleComplex * hCoefsStatic {nullptr}; // static part of the h(t) batched coefficients in the one-body term (of length operBatchSize)
  cuDoubleComplex * hCoefsTotal {nullptr};  // total h(t) batched coefficients in the one-body term (of length operBatchSize)
  cuDoubleComplex * fCoefsStaticMinus {nullptr}; // static part of the f(t) batched coefficients in the two-body term (of length operBatchSize)
  cuDoubleComplex * fCoefsTotalMinus {nullptr};  // total f(t) batched coefficients in the two-body term (of length operBatchSize)
  cuDoubleComplex * fCoefsStaticPlus {nullptr};  // static part of the f(t) batched coefficients in the dual two-body term (of length operBatchSize)
  cuDoubleComplex * fCoefsTotalPlus {nullptr};   // total f(t) batched coefficients in the dual two-body term (of length operBatchSize)
  // Final Liouvillian operator
  cudensitymatOperator_t liouvillian; // full operator: (-i * (H1 + H2) * {..}) + (i * {..} * (H1 + H2)) + D1{..} (super-operator)

public:

  // Constructor constructs a user-defined Liouvillian operator
  UserDefinedLiouvillian(cudensitymatHandle_t contextHandle,             // library context handle
                         const std::vector<int64_t> & hilbertSpaceShape, // Hilbert space shape
                         int64_t batchSize):                             // batch size for the super-operator
    handle(contextHandle), operBatchSize(batchSize), spaceShape(hilbertSpaceShape)
  {
    // Define the necessary operator tensors in GPU memory (F-order storage!)
    spinXelems = createInitializeArrayGPU<NumericalType>(  // X[i0; j0]
                  {{0.0, 0.0}, {1.0, 0.0},   // 1st column of matrix X
                   {1.0, 0.0}, {0.0, 0.0}}); // 2nd column of matrix X

    spinYYelems = createInitializeArrayGPU<NumericalType>(  // YY[i0, i1; j0, j1] := Y[i0; j0] * Y[i1; j1]
                    {{0.0, 0.0},  {0.0, 0.0}, {0.0, 0.0}, {-1.0, 0.0},  // 1st column of matrix YY
                     {0.0, 0.0},  {0.0, 0.0}, {1.0, 0.0}, {0.0, 0.0},   // 2nd column of matrix YY
                     {0.0, 0.0},  {1.0, 0.0}, {0.0, 0.0}, {0.0, 0.0},   // 3rd column of matrix YY
                     {-1.0, 0.0}, {0.0, 0.0}, {0.0, 0.0}, {0.0, 0.0}}); // 4th column of matrix YY

    spinZZelems = createInitializeArrayGPU<NumericalType>(  // ZZ[i0, i1; j0, j1] := Z[i0; j0] * Z[i1; j1]
                    {{1.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},  {0.0, 0.0},   // 1st column of matrix ZZ
                     {0.0, 0.0}, {-1.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},   // 2nd column of matrix ZZ
                     {0.0, 0.0}, {0.0, 0.0},  {-1.0, 0.0}, {0.0, 0.0},   // 3rd column of matrix ZZ
                     {0.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},  {1.0, 0.0}}); // 4th column of matrix ZZ

    // Construct the necessary Elementary Tensor Operators
    //  X_i operator
    HANDLE_CUDM_ERROR(cudensitymatCreateElementaryOperator(handle,
                        1,                                   // one-body operator
                        std::vector<int64_t>({2}).data(),    // acts in tensor space of shape {2}
                        CUDENSITYMAT_OPERATOR_SPARSITY_NONE, // dense tensor storage
                        0,                                   // 0 for dense tensors
                        nullptr,                             // nullptr for dense tensors
                        dataType,                            // data type
                        spinXelems,                          // tensor elements in GPU memory
                        cudensitymatTensorCallbackNone,      // no tensor callback function (tensor is not time-dependent)
                        cudensitymatTensorGradientCallbackNone, // no tensor gradient callback function
                        &spinX));                            // the created elementary tensor operator
    //  ZZ_ij = Z_i * Z_j fused operator
    HANDLE_CUDM_ERROR(cudensitymatCreateElementaryOperator(handle,
                        2,                                   // two-body operator
                        std::vector<int64_t>({2,2}).data(),  // acts in tensor space of shape {2,2}
                        CUDENSITYMAT_OPERATOR_SPARSITY_NONE, // dense tensor storage
                        0,                                   // 0 for dense tensors
                        nullptr,                             // nullptr for dense tensors
                        dataType,                            // data type
                        spinZZelems,                         // tensor elements in GPU memory
                        cudensitymatTensorCallbackNone,      // no tensor callback function (tensor is not time-dependent)
                        cudensitymatTensorGradientCallbackNone, // no tensor gradient callback function
                        &spinZZ));                           // the created elementary tensor operator
    //  YY_ii = Y_i * {..} * Y_i fused operator (note action from different sides)
    HANDLE_CUDM_ERROR(cudensitymatCreateElementaryOperator(handle,
                        2,                                   // two-body operator
                        std::vector<int64_t>({2,2}).data(),  // acts in tensor space of shape {2,2}
                        CUDENSITYMAT_OPERATOR_SPARSITY_NONE, // dense tensor storage
                        0,                                   // 0 for dense tensors
                        nullptr,                             // nullptr for dense tensors
                        dataType,                            // data type
                        spinYYelems,                         // tensor elements in GPU memory
                        cudensitymatTensorCallbackNone,      // no tensor callback function (tensor is not time-dependent)
                        cudensitymatTensorGradientCallbackNone, // no tensor gradient callback function
                        &spinYY));                           // the created elementary tensor operator

    // Construct the necessary Operator Terms from tensor products of Elementary Tensor Operators
    //  Create an empty operator term
    HANDLE_CUDM_ERROR(cudensitymatCreateOperatorTerm(handle,
                        spaceShape.size(),                   // Hilbert space rank (number of modes)
                        spaceShape.data(),                   // Hilbert space shape (mode extents)
                        &oneBodyTerm));                      // the created empty operator term
    //  Define the batched operator term: H1[k] = sum_{i} {h_i(t)[k] * X_i}
    hCoefsStatic = static_cast<cuDoubleComplex *>(createInitializeArrayGPU(
      std::vector<cuDoubleComplex>(operBatchSize, make_cuDoubleComplex(1.0, 0.0)))); // 1.0 constant for all coefficient instances in the batch
    hCoefsTotal = static_cast<cuDoubleComplex *>(createInitializeArrayGPU(
      std::vector<cuDoubleComplex>(operBatchSize, make_cuDoubleComplex(0.0, 0.0)))); // storage for the total coefficients for all instances of the batch
    for (int32_t i = 0; i < spaceShape.size(); ++i) {
      HANDLE_CUDM_ERROR(cudensitymatOperatorTermAppendElementaryProductBatch(handle,
                          oneBodyTerm,
                          1,                                                             // number of elementary tensor operators in the product
                          std::vector<cudensitymatElementaryOperator_t>({spinX}).data(), // elementary tensor operators forming the product
                          std::vector<int32_t>({i}).data(),                              // space modes acted on by the operator product
                          std::vector<int32_t>({0}).data(),                              // space mode action duality (0: from the left; 1: from the right)
                          operBatchSize,                                                 // batch size
                          hCoefsStatic,                                                  // static part of the h(t) batched coefficients in the one-body term
                          hCoefsTotal,                                                   // total h(t) batched coefficients in the one-body term
                          {hCoefBatchComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr}, // CPU batched scalar callback function defining the time-dependent coefficient associated with this operator product
                          {hCoefBatchGradComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr})); // CPU batched scalar gradient callback function defining the gradient of the coefficient with respect to the parameter Omega
    }
    //  Create an empty operator term
    HANDLE_CUDM_ERROR(cudensitymatCreateOperatorTerm(handle,
                        spaceShape.size(),                   // Hilbert space rank (number of modes)
                        spaceShape.data(),                   // Hilbert space shape (mode extents)
                        &twoBodyTerm));                      // the created empty operator term
    //  Define the operator term: H2 = f(t) * sum_{i < j} {g_ij * ZZ_ij}
    for (int32_t i = 0; i < spaceShape.size() - 1; ++i) {
      for (int32_t j = (i + 1); j < spaceShape.size(); ++j) {
        const double g_ij = -1.0 / static_cast<double>(i + j + 1); // assign some value to the time-independent g_ij coefficient
        HANDLE_CUDM_ERROR(cudensitymatOperatorTermAppendElementaryProduct(handle,
                            twoBodyTerm,
                            1,                                                              // number of elementary tensor operators in the product
                            std::vector<cudensitymatElementaryOperator_t>({spinZZ}).data(), // elementary tensor operators forming the product
                            std::vector<int32_t>({i, j}).data(),                            // space modes acted on by the operator product
                            std::vector<int32_t>({0, 0}).data(),                            // space mode action duality (0: from the left; 1: from the right)
                            make_cuDoubleComplex(g_ij, 0.0),                                // g_ij static coefficient: Always 64-bit-precision complex number
                            cudensitymatScalarCallbackNone,                                 // no time-dependent coefficient associated with this operator product
                            cudensitymatScalarGradientCallbackNone));                       // no coefficient gradient associated with this operator product
      }
    }
    //  Create an empty operator term
    HANDLE_CUDM_ERROR(cudensitymatCreateOperatorTerm(handle,
                        spaceShape.size(),                   // Hilbert space rank (number of modes)
                        spaceShape.data(),                   // Hilbert space shape (mode extents)
                        &noiseTerm));                        // the created empty operator term
    //  Define the operator term: D1 = d * sum_{i} {YY_ii}
    for (int32_t i = 0; i < spaceShape.size(); ++i) {
      HANDLE_CUDM_ERROR(cudensitymatOperatorTermAppendElementaryProduct(handle,
                          noiseTerm,
                          1,                                                              // number of elementary tensor operators in the product
                          std::vector<cudensitymatElementaryOperator_t>({spinYY}).data(), // elementary tensor operators forming the product
                          std::vector<int32_t>({i, i}).data(),                            // space modes acted on by the operator product (from different sides)
                          std::vector<int32_t>({0, 1}).data(),                            // space mode action duality (0: from the left; 1: from the right)
                          make_cuDoubleComplex(1.0, 0.0),                                 // default coefficient: Always 64-bit-precision complex number
                          cudensitymatScalarCallbackNone,                                 // no time-dependent coefficient associated with this operator product
                          cudensitymatScalarGradientCallbackNone));                       // no coefficient gradient associated with this operator product
    }

    // Construct the full Liouvillian operator as a sum of the created operator terms
    //  Create an empty operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatCreateOperator(handle,
                        spaceShape.size(),                // Hilbert space rank (number of modes)
                        spaceShape.data(),                // Hilbert space shape (modes extents)
                        &liouvillian));                   // the created empty operator (super-operator)
    //  Append an operator term to the operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        oneBodyTerm,                      // appended operator term
                        0,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(0.0, -1.0),  // -i constant
                        cudensitymatScalarCallbackNone,   // no time-dependent coefficient associated with the operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with the operator term as a whole
    //  Append an operator term to the operator (super-operator)
    fCoefsStaticMinus = static_cast<cuDoubleComplex *>(createInitializeArrayGPU(
      std::vector<cuDoubleComplex>(operBatchSize, make_cuDoubleComplex(0.0, -1.0)))); // -i constant for all coefficient instances in the batch
    fCoefsTotalMinus = static_cast<cuDoubleComplex *>(createInitializeArrayGPU(
      std::vector<cuDoubleComplex>(operBatchSize, make_cuDoubleComplex(0.0, 0.0)))); // storage for the total coefficients for all instances of the batch
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTermBatch(handle,
                        liouvillian,
                        twoBodyTerm,                      // appended operator term
                        0,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        operBatchSize,                    // number of instances of the operator term in the batch (they differ by the coefficient value)
                        fCoefsStaticMinus,                // static part of the f(t) batched coefficients in the two-body term
                        fCoefsTotalMinus,                 // total f(t) batched coefficients in the two-body term
                        {fCoefBatchComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr}, // CPU batched scalar callback function defining the time-dependent coefficient associated with this operator term as a whole
                        {fCoefBatchGradComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr})); // CPU batched scalar gradient callback function defining the gradient of the coefficient with respect to parameter Omega
    //  Append an operator term to the operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        oneBodyTerm,                      // appended operator term
                        1,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(0.0, +1.0),  // +i constant
                        cudensitymatScalarCallbackNone,   // no time-dependent coefficient associated with the operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with the operator term as a whole
    //  Append an operator term to the operator (super-operator)
    fCoefsStaticPlus = static_cast<cuDoubleComplex *>(createInitializeArrayGPU(
      std::vector<cuDoubleComplex>(operBatchSize, make_cuDoubleComplex(0.0, +1.0)))); // +i constant for all coefficient instances in the batch
    fCoefsTotalPlus = static_cast<cuDoubleComplex *>(createInitializeArrayGPU(
      std::vector<cuDoubleComplex>(operBatchSize, make_cuDoubleComplex(0.0, 0.0)))); // storage for the total coefficients for all instances of the batch
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTermBatch(handle,
                        liouvillian,
                        twoBodyTerm,                      // appended operator term
                        1,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        operBatchSize,                    // number of instances of the operator term in the batch (they differ by the coefficient value)
                        fCoefsStaticPlus,                 // static part of the f(t) batched coefficients in the dual two-body term
                        fCoefsTotalPlus,                  // total f(t) batched coefficients in the dual two-body term
                        {fCoefBatchComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr}, // CPU batched scalar callback function defining the time-dependent coefficient associated with this operator term as a whole
                        {fCoefBatchGradComplex64, CUDENSITYMAT_CALLBACK_DEVICE_CPU, nullptr})); // CPU batched scalar gradient callback function defining the gradient of the coefficient with respect to parameter Omega
    //  Append an operator term to the operator (super-operator)
    const double d = 1.0; // assign some value to the time-independent coefficient
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        noiseTerm,                        // appended operator term
                        0,                                // operator term action duality as a whole (no duality reversing in this case)
                        make_cuDoubleComplex(d, 0.0),     // static coefficient associated with the operator term as a whole
                        cudensitymatScalarCallbackNone,   // no time-dependent coefficient associated with the operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with the operator term as a whole
  }

  // Destructor destructs the user-defined Liouvillian operator
  ~UserDefinedLiouvillian()
  {
    // Destroy the Liouvillian operator
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperator(liouvillian));

    // Destroy operator terms
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorTerm(noiseTerm));
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorTerm(twoBodyTerm));
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorTerm(oneBodyTerm));

    // Destroy elementary tensor operators
    HANDLE_CUDM_ERROR(cudensitymatDestroyElementaryOperator(spinYY));
    HANDLE_CUDM_ERROR(cudensitymatDestroyElementaryOperator(spinZZ));
    HANDLE_CUDM_ERROR(cudensitymatDestroyElementaryOperator(spinX));

    // Destroy the batched coefficients
    destroyArrayGPU(fCoefsTotalPlus);
    destroyArrayGPU(fCoefsStaticPlus);
    destroyArrayGPU(fCoefsTotalMinus);
    destroyArrayGPU(fCoefsStaticMinus);
    destroyArrayGPU(hCoefsTotal);
    destroyArrayGPU(hCoefsStatic);

    // Destroy operator tensors
    destroyArrayGPU(spinYYelems);
    destroyArrayGPU(spinZZelems);
    destroyArrayGPU(spinXelems);
  }

  // Disable copy constructor/assignment (GPU resources are private, no deep copy)
  UserDefinedLiouvillian(const UserDefinedLiouvillian &) = delete;
  UserDefinedLiouvillian & operator=(const UserDefinedLiouvillian &) = delete;
  UserDefinedLiouvillian(UserDefinedLiouvillian &&) = delete;
  UserDefinedLiouvillian & operator=(UserDefinedLiouvillian &&) = delete;

  /** Returns the number of externally provided Hamiltonian parameters. */
  int32_t getNumParameters() const
  {
    return 1; // one parameter Omega
  }

  /** Get access to the constructed Liouvillian operator. */
  cudensitymatOperator_t & get()
  {
    return liouvillian;
  }

};

Once the batched Liouvillian operator has been defined, the rest of the code logic is largely identical to the non-batched case.

/* Copyright (c) 2026-2026, NVIDIA CORPORATION & AFFILIATES.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 */

#include <cudensitymat.h>  // cuDensityMat library header
#include "helpers.h"       // helper functions


// Batched time-dependent transverse-field Ising Hamiltonian operator
// with ordered and fused ZZ terms, plus fused unitary dissipation terms
#include "transverse_ising_full_fused_noisy_batch_grad.h" // user-defined batched Liouvillian operator example

#include <cmath>
#include <complex>
#include <vector>
#include <chrono>
#include <iostream>
#include <cassert>


// Number of times to perform operator action on a quantum state
constexpr int NUM_REPEATS = 2;

// Logging verbosity
bool verbose = true;


// Example workflow
void exampleWorkflow(cudensitymatHandle_t handle)
{
  // Define the composite Hilbert space shape and
  // quantum state batch size (number of individual quantum states in a batched simulation)
  const std::vector<int64_t> spaceShape({2,2,2,2}); // dimensions of quantum degrees of freedom
  const int64_t batchSize = 3;                      // number of quantum states per batch

  if (verbose) {
    std::cout << "Hilbert space rank = " << spaceShape.size() << "; Shape = (";
    for (const auto & dimsn: spaceShape)
      std::cout << dimsn << ",";
    std::cout << ")" << std::endl;
    std::cout << "Quantum state batch size = " << batchSize << std::endl;
  }

  // Construct a user-defined batched Liouvillian operator using a convenience C++ class
  // Note that the constructed Liouvillian operator has some batched coefficients
  UserDefinedLiouvillian liouvillian(handle, spaceShape, batchSize);
  if (verbose)
    std::cout << "Constructed the Liouvillian operator\n";

  // Set and place external user-provided Hamiltonian parameters in GPU memory
  const int32_t numParams = liouvillian.getNumParameters(); // number of external user-provided Hamiltonian parameters
  if (verbose)
    std::cout << "Number of external user-provided Hamiltonian parameters = " << numParams << std::endl;
  std::vector<double> cpuHamParams(numParams * batchSize);
  for (int64_t j = 0; j < batchSize; ++j) {
    for (int32_t i = 0; i < numParams; ++i) {
      cpuHamParams[j * numParams + i] = double(i+1) / double(j+1); // just setting some parameter values for each instance of the batch
    }
  }
  auto * hamiltonianParams = static_cast<double *>(createInitializeArrayGPU(cpuHamParams));
  if (verbose)
    std::cout << "Created an array of external user-provided Hamiltonian parameters in GPU memory\n";

  // Create an array of gradients for the user-provided Hamiltonian parameters in GPU memory
  std::vector<double> cpuHamParamsGrad(numParams * batchSize, 0.0);
  auto * hamiltonianParamsGrad = static_cast<double *>(createInitializeArrayGPU(cpuHamParamsGrad));
  if (verbose)
    std::cout << "Created an array of gradients for the external user-provided Hamiltonian parameters in GPU memory\n";

  // Declare the input quantum state
  cudensitymatState_t inputState;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,
                      &inputState));

  // Query the size of the quantum state storage
  std::size_t storageSize {0}; // only one storage component (tensor) is needed (no tensor factorization)
  HANDLE_CUDM_ERROR(cudensitymatStateGetComponentStorageSize(handle,
                      inputState,
                      1,               // only one storage component (tensor)
                      &storageSize));  // storage size in bytes
  const std::size_t stateVolume = storageSize / sizeof(NumericalType);  // quantum state tensor volume (number of elements)
  if (verbose)
    std::cout << "Quantum state storage size (bytes) = " << storageSize << std::endl;

  // Prepare some initial value for the input quantum state batch
  std::vector<NumericalType> inputStateValue(stateVolume);
  if constexpr (std::is_same_v<NumericalType, float>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = 1.0f / float(i+1); // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, double>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = 1.0 / double(i+1); // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, std::complex<float>>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = NumericalType{1.0f / float(i+1), -1.0f / float(i+2)}; // just some value
    }
  } else if constexpr (std::is_same_v<NumericalType, std::complex<double>>) {
    for (std::size_t i = 0; i < stateVolume; ++i) {
      inputStateValue[i] = NumericalType{1.0 / double(i+1), -1.0 / double(i+2)}; // just some value
    }
  } else {
    std::cerr << "Error: Unsupported data type!\n";
    std::exit(1);
  }
  // Allocate initialized GPU storage for the input quantum state with prepared values
  auto * inputStateElems = createInitializeArrayGPU(inputStateValue);
  if (verbose)
    std::cout << "Allocated input quantum state storage and initialized it to some value\n";

  // Attach initialized GPU storage to the input quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      inputState,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({inputStateElems}).data(),      // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed input quantum state\n";

  // Declare the output quantum state of the same shape
  cudensitymatState_t outputState;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,
                      &outputState));

  // Allocate GPU storage for the output quantum state
  auto * outputStateElems = createArrayGPU<NumericalType>(stateVolume);
  if (verbose)
    std::cout << "Allocated output quantum state storage\n";

  // Attach GPU storage to the output quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      outputState,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({outputStateElems}).data(),     // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed output quantum state\n";

  // Declare the adjoint input quantum state of the same shape
  cudensitymatState_t inputStateAdj;
  HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                      CUDENSITYMAT_STATE_PURITY_MIXED,  // pure (state vector) or mixed (density matrix) state
                      spaceShape.size(),
                      spaceShape.data(),
                      batchSize,
                      dataType,  // data type must match
                      &inputStateAdj));

  // Allocate GPU storage for the adjoint input quantum state
  auto * inputStateAdjElems = createArrayGPU<NumericalType>(stateVolume);
  if (verbose)
    std::cout << "Allocated adjoint input quantum state storage\n";

  // Attach GPU storage to the adjoint input quantum state
  HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                      inputStateAdj,
                      1,                                                 // only one storage component (tensor)
                      std::vector<void*>({inputStateAdjElems}).data(),   // pointer to the GPU storage for the quantum state
                      std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
  if (verbose)
    std::cout << "Constructed adjoint input quantum state\n";

  // Declare a workspace descriptor
  cudensitymatWorkspaceDescriptor_t workspaceDescr;
  HANDLE_CUDM_ERROR(cudensitymatCreateWorkspace(handle, &workspaceDescr));

  // Query free GPU memory
  std::size_t freeMem = 0, totalMem = 0;
  HANDLE_CUDA_ERROR(cudaMemGetInfo(&freeMem, &totalMem));
  freeMem = static_cast<std::size_t>(static_cast<double>(freeMem) * 0.45); // take 45% of the free memory for the workspace buffer
  if (verbose)
    std::cout << "Max workspace buffer size (bytes) = " << freeMem << std::endl;

  // Allocate GPU storage for the workspace buffer
  const std::size_t bufferVolume = freeMem / sizeof(NumericalType);
  auto * workspaceBuffer = createArrayGPU<NumericalType>(bufferVolume);
  if (verbose)
    std::cout << "Allocated workspace buffer of size (bytes) = " << freeMem << std::endl;

  // Prepare the Liouvillian operator action on a quantum state (needs to be done only once)
  auto startTime = std::chrono::high_resolution_clock::now();
  HANDLE_CUDM_ERROR(cudensitymatOperatorPrepareAction(handle,
                      liouvillian.get(),
                      inputState,
                      outputState,
                      CUDENSITYMAT_COMPUTE_64F,  // GPU compute type
                      freeMem,                   // max available GPU free memory for the workspace
                      workspaceDescr,            // workspace descriptor
                      0x0));                     // default CUDA stream
  auto finishTime = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> timeSec = finishTime - startTime;
  if (verbose)
    std::cout << "Operator action preparation time (sec) = " << timeSec.count() << std::endl;

  // Query the required workspace buffer size (bytes)
  std::size_t requiredBufferSize {0};
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceGetMemorySize(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      &requiredBufferSize));
  if (verbose)
    std::cout << "Required workspace buffer size (bytes) = " << requiredBufferSize << std::endl;

  if (requiredBufferSize > freeMem) {
    std::cerr << "Error: Required workspace buffer size is greater than the available GPU free memory!\n";
    std::exit(1);
  }

  // Attach the workspace buffer to the workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceSetMemory(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      workspaceBuffer,
                      requiredBufferSize));
  if (verbose)
    std::cout << "Attached workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Apply the Liouvillian operator to the input quatum state
  // and accumulate its action into the output quantum state (note the accumulative += semantics)
  for (int32_t repeat = 0; repeat < NUM_REPEATS; ++repeat) { // repeat multiple times for accurate timing
    // Zero out the output quantum state
    HANDLE_CUDM_ERROR(cudensitymatStateInitializeZero(handle,
                        outputState,
                        0x0));
    if (verbose)
      std::cout << "Initialized the output quantum state to zero\n";
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    startTime = std::chrono::high_resolution_clock::now();
    HANDLE_CUDM_ERROR(cudensitymatOperatorComputeAction(handle,
                        liouvillian.get(),
                        0.3,                // time point (some value)
                        batchSize,          // user-defined batch size
                        numParams,          // number of external user-defined Hamiltonian parameters
                        hamiltonianParams,  // external Hamiltonian parameters in GPU memory
                        inputState,         // input quantum state
                        outputState,        // output quantum state
                        workspaceDescr,     // workspace descriptor
                        0x0));              // default CUDA stream
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    finishTime = std::chrono::high_resolution_clock::now();
    timeSec = finishTime - startTime;
    if (verbose)
      std::cout << "Operator action computation time (sec) = " << timeSec.count() << std::endl;
  }

  // Compute the squared norm of the output quantum state
  void * norm2 = createInitializeArrayGPU(std::vector<double>(batchSize, 0.0));
  HANDLE_CUDM_ERROR(cudensitymatStateComputeNorm(handle,
                      outputState,
                      norm2,
                      0x0));
  if (verbose) {
    std::cout << "Computed the output quantum state norm:\n";
    printArrayGPU<double>(norm2, batchSize);
  }

  HANDLE_CUDA_ERROR(cudaDeviceSynchronize());

  // Prepare the Liouvillian operator action backward differentiation (needs to be done only once)
  startTime = std::chrono::high_resolution_clock::now();
  HANDLE_CUDM_ERROR(cudensitymatOperatorPrepareActionBackwardDiff(handle,
                      liouvillian.get(),
                      inputState,
                      outputState,               // adjoint output quantum state is always congruent to the output quantum state
                      CUDENSITYMAT_COMPUTE_64F,  // GPU compute type
                      freeMem,                   // max available GPU free memory for the workspace buffer
                      workspaceDescr,            // workspace descriptor
                      0x0));                     // default CUDA stream
  finishTime = std::chrono::high_resolution_clock::now();
  timeSec = finishTime - startTime;
  if (verbose)
    std::cout << "Operator action backward differentiation preparation time (sec) = " << timeSec.count() << std::endl;

  // Query the required workspace buffer size (bytes)
  requiredBufferSize = 0;
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceGetMemorySize(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      &requiredBufferSize));
  if (verbose)
    std::cout << "Required workspace buffer size (bytes) = " << requiredBufferSize << std::endl;

  if (requiredBufferSize > freeMem) {
    std::cerr << "Error: Required workspace buffer size is greater than the available GPU free memory!\n";
    std::exit(1);
  }

  // Attach the workspace buffer to the workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceSetMemory(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      workspaceBuffer,
                      requiredBufferSize));
  if (verbose)
    std::cout << "Attached workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Liouvillian operator action backward differentiation:
  // The adjoint output quantum state, which is always congruent to the output quantum state,
  // depends on the user-defined cost function, so here we simply pass the previously computed output quantum state.
  // In real-life applications, the user will pass their adjoint output quantum state, computed for their cost function.
  for (int32_t repeat = 0; repeat < NUM_REPEATS; ++repeat) { // repeat multiple times for accurate timing
    // Zero out the adjoint input quantum state and gradients
    HANDLE_CUDM_ERROR(cudensitymatStateInitializeZero(handle,
                        inputStateAdj,
                        0x0));
    initializeArrayGPU(std::vector<double>(numParams * batchSize, 0.0), hamiltonianParamsGrad);
    if (verbose)
      std::cout << "Initialized the adjoint input quantum state and gradients to zero\n";
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    startTime = std::chrono::high_resolution_clock::now();
    HANDLE_CUDM_ERROR(cudensitymatOperatorComputeActionBackwardDiff(handle,
                        liouvillian.get(),
                        0.3,                    // time point (some value)
                        batchSize,              // user-defined batch size
                        numParams,              // number of external user-defined Hamiltonian parameters
                        hamiltonianParams,      // external Hamiltonian parameters in GPU memory
                        inputState,             // input quantum state
                        outputState,            // adjoint output quantum state (here we just pass the previously computed output quantum state for simplicity)
                        inputStateAdj,          // adjoint input quantum state
                        hamiltonianParamsGrad,  // partial gradients with respect to the user-defined real parameters
                        workspaceDescr,         // workspace descriptor
                        0x0));                  // default CUDA stream
    HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
    finishTime = std::chrono::high_resolution_clock::now();
    timeSec = finishTime - startTime;
    if (verbose)
      std::cout << "Operator action backward differentiation computation time (sec) = " << timeSec.count() << std::endl;
  }

  // Compute the squared norm of the adjoint input quantum state
  initializeArrayGPU(std::vector<double>(batchSize, 0.0), norm2);
  HANDLE_CUDM_ERROR(cudensitymatStateComputeNorm(handle,
                      inputStateAdj,
                      norm2,
                      0x0));
  if (verbose) {
    std::cout << "Computed the adjoint input quantum state norm:\n";
    printArrayGPU<double>(norm2, batchSize);
    std::cout << "Hamiltonian parameters gradients:\n";
    printArrayGPU<double>(hamiltonianParamsGrad, numParams * batchSize);
  }

  HANDLE_CUDA_ERROR(cudaDeviceSynchronize());

  // Destroy the norm2 array
  destroyArrayGPU(norm2);

  // Destroy workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatDestroyWorkspace(workspaceDescr));

  // Destroy workspace buffer storage
  destroyArrayGPU(workspaceBuffer);

  // Destroy quantum states
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(inputStateAdj));
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(outputState));
  HANDLE_CUDM_ERROR(cudensitymatDestroyState(inputState));

  // Destroy quantum state storage
  destroyArrayGPU(inputStateAdjElems);
  destroyArrayGPU(outputStateElems);
  destroyArrayGPU(inputStateElems);

  // Destroy external Hamiltonian parameters
  destroyArrayGPU(static_cast<void *>(hamiltonianParamsGrad));
  destroyArrayGPU(static_cast<void *>(hamiltonianParams));

  if (verbose)
    std::cout << "Destroyed resources\n" << std::flush;
}


int main(int argc, char ** argv)
{
  // Assign a GPU to the process
  HANDLE_CUDA_ERROR(cudaSetDevice(0));
  if (verbose)
    std::cout << "Set active device\n";

  // Create a library handle
  cudensitymatHandle_t handle;
  HANDLE_CUDM_ERROR(cudensitymatCreate(&handle));
  if (verbose)
    std::cout << "Created a library handle\n";

  // Run the example
  exampleWorkflow(handle);

  // Destroy the library handle
  HANDLE_CUDM_ERROR(cudensitymatDestroy(handle));
  if (verbose)
    std::cout << "Destroyed the library handle\n";

  HANDLE_CUDA_ERROR(cudaDeviceReset());

  // Done
  return 0;
}

Code example (serial execution of operator eigenspectrum computation)#

The following code example illustrates how to use the cuDensityMat library for computing the extreme eigenspectrum of a given operator. The full sample code can be found in the NVIDIA/cuQuantum repository (main serial eigenspectrum code and operator definition as well as the utility code).

First, similarly to the previous examples, we define a transverse-field Ising Hamiltonian with fused ZZ terms the eigenspectrum of which we want to compute. Specifically, in this case, we want to compute a number of the smallest real eigenvalues and their corresponding eigenvectors (pure quantum states). For simplicity, we made our operator time-independent (static).

/* Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 */

#pragma once

#include <cudensitymat.h> // cuDensityMat library header
#include "helpers.h"      // GPU helper functions

#include <cmath>
#include <complex>
#include <vector>
#include <iostream>
#include <cassert>


/* DESCRIPTION:
   Transverse-field Ising Hamiltonian operator with ordered and fused ZZ terms:
    H = sum_{i} {h_i * X_i}         // transverse field sum of X_i operators with static h_i coefficients 
      + sum_{i < j} {g_ij * ZZ_ij}  // sum of the fused ordered {Z_i * Z_j} terms with static g_ij coefficients
*/

/** Define the numerical type and data type for the GPU computations (same) */
using NumericalType = std::complex<double>;      // do not change
constexpr cudaDataType_t dataType = CUDA_C_64F;  // do not change


/** Convenience class which encapsulates a user-defined Liouvillian operator (system Hamiltonian + dissipation terms):
 *  - Constructor constructs the desired Liouvillian operator (`cudensitymatOperator_t`)
 *  - Method `get()` returns a reference to the constructed Liouvillian operator
 *  - Destructor releases all resources used by the Liouvillian operator
 */
class UserDefinedLiouvillian final
{
private:
  // Data members
  cudensitymatHandle_t handle;             // library context handle
  int64_t stateBatchSize;                  // quantum state batch size
  const std::vector<int64_t> spaceShape;   // Hilbert space shape (extents of the modes of the composite Hilbert space)
  void * spinXelems {nullptr};             // elements of the X spin operator in GPU RAM (F-order storage)
  void * spinZZelems {nullptr};            // elements of the fused ZZ two-spin operator in GPU RAM (F-order storage)
  cudensitymatElementaryOperator_t spinX;  // X spin operator (elementary tensor operator)
  cudensitymatElementaryOperator_t spinZZ; // fused ZZ two-spin operator (elementary tensor operator)
  cudensitymatOperatorTerm_t oneBodyTerm;  // operator term: H1 = sum_{i} {h_i * X_i} (one-body term)
  cudensitymatOperatorTerm_t twoBodyTerm;  // operator term: H2 = sum_{i < j} {g_ij * ZZ_ij} (two-body term)
  cudensitymatOperator_t liouvillian;      // full operator: H = H1 + H2

public:

  // Constructor constructs a user-defined Liouvillian operator
  UserDefinedLiouvillian(cudensitymatHandle_t contextHandle,             // library context handle
                         const std::vector<int64_t> & hilbertSpaceShape, // Hilbert space shape
                         int64_t batchSize):                             // batch size for the quantum state
    handle(contextHandle), stateBatchSize(batchSize), spaceShape(hilbertSpaceShape)
  {
    // Define the necessary operator tensors in GPU memory (F-order storage!)
    spinXelems = createInitializeArrayGPU<NumericalType>(  // X[i0; j0]
                  {{0.0, 0.0}, {1.0, 0.0},   // 1st column of matrix X
                   {1.0, 0.0}, {0.0, 0.0}}); // 2nd column of matrix X

    spinZZelems = createInitializeArrayGPU<NumericalType>(  // ZZ[i0, i1; j0, j1] := Z[i0; j0] * Z[i1; j1]
                    {{1.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},  {0.0, 0.0},   // 1st column of matrix ZZ
                     {0.0, 0.0}, {-1.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},   // 2nd column of matrix ZZ
                     {0.0, 0.0}, {0.0, 0.0},  {-1.0, 0.0}, {0.0, 0.0},   // 3rd column of matrix ZZ
                     {0.0, 0.0}, {0.0, 0.0},  {0.0, 0.0},  {1.0, 0.0}}); // 4th column of matrix ZZ

    // Construct the necessary Elementary Tensor Operators
    //  X_i operator
    HANDLE_CUDM_ERROR(cudensitymatCreateElementaryOperator(handle,
                        1,                                   // one-body operator
                        std::vector<int64_t>({2}).data(),    // acts in tensor space of shape {2}
                        CUDENSITYMAT_OPERATOR_SPARSITY_NONE, // dense tensor storage
                        0,                                   // 0 for dense tensors
                        nullptr,                             // nullptr for dense tensors
                        dataType,                            // data type
                        spinXelems,                          // tensor elements in GPU memory
                        cudensitymatTensorCallbackNone,      // no tensor callback function (tensor is not time-dependent)
                        cudensitymatTensorGradientCallbackNone, // no tensor gradient callback function
                        &spinX));                            // the created elementary tensor operator
    //  ZZ_ij = Z_i * Z_j fused operator
    HANDLE_CUDM_ERROR(cudensitymatCreateElementaryOperator(handle,
                        2,                                   // two-body operator
                        std::vector<int64_t>({2,2}).data(),  // acts in tensor space of shape {2,2}
                        CUDENSITYMAT_OPERATOR_SPARSITY_NONE, // dense tensor storage
                        0,                                   // 0 for dense tensors
                        nullptr,                             // nullptr for dense tensors
                        dataType,                            // data type
                        spinZZelems,                         // tensor elements in GPU memory
                        cudensitymatTensorCallbackNone,      // no tensor callback function (tensor is not time-dependent)
                        cudensitymatTensorGradientCallbackNone, // no tensor gradient callback function
                        &spinZZ));                           // the created elementary tensor operator

    // Construct the necessary Operator Terms from tensor products of Elementary Tensor Operators
    //  Create an empty operator term
    HANDLE_CUDM_ERROR(cudensitymatCreateOperatorTerm(handle,
                        spaceShape.size(),                   // Hilbert space rank (number of modes)
                        spaceShape.data(),                   // Hilbert space shape (mode extents)
                        &oneBodyTerm));                      // the created empty operator term
    //  Define the operator term: H1 = sum_{i} {h_i * X_i}
    for (int32_t i = 0; i < spaceShape.size(); ++i) {
      const double h_i = 1.0 / static_cast<double>(i+1); // assign some value to the time-independent h_i coefficient
      HANDLE_CUDM_ERROR(cudensitymatOperatorTermAppendElementaryProduct(handle,
                          oneBodyTerm,
                          1,                                                             // number of elementary tensor operators in the product
                          std::vector<cudensitymatElementaryOperator_t>({spinX}).data(), // elementary tensor operators forming the product
                          std::vector<int32_t>({i}).data(),                              // space modes acted on by the operator product
                          std::vector<int32_t>({0}).data(),                              // space mode action duality (0: from the left; 1: from the right)
                          make_cuDoubleComplex(h_i, 0.0),                                // h_i constant coefficient: Always 64-bit-precision complex number
                          cudensitymatScalarCallbackNone,                                // no time-dependent coefficient associated with this operator product
                          cudensitymatScalarGradientCallbackNone));                      // no coefficient gradient associated with this operator product
    }
    //  Create an empty operator term
    HANDLE_CUDM_ERROR(cudensitymatCreateOperatorTerm(handle,
                        spaceShape.size(),                   // Hilbert space rank (number of modes)
                        spaceShape.data(),                   // Hilbert space shape (mode extents)
                        &twoBodyTerm));                      // the created empty operator term
    //  Define the operator term: H2 = sum_{i < j} {g_ij * ZZ_ij}
    for (int32_t i = 0; i < spaceShape.size() - 1; ++i) {
      for (int32_t j = (i + 1); j < spaceShape.size(); ++j) {
        const double g_ij = -1.0 / static_cast<double>(i + j + 1); // assign some value to the time-independent g_ij coefficient
        HANDLE_CUDM_ERROR(cudensitymatOperatorTermAppendElementaryProduct(handle,
                            twoBodyTerm,
                            1,                                                              // number of elementary tensor operators in the product
                            std::vector<cudensitymatElementaryOperator_t>({spinZZ}).data(), // elementary tensor operators forming the product
                            std::vector<int32_t>({i, j}).data(),                            // space modes acted on by the operator product
                            std::vector<int32_t>({0, 0}).data(),                            // space mode action duality (0: from the left; 1: from the right)
                            make_cuDoubleComplex(g_ij, 0.0),                                // g_ij constant coefficient: Always 64-bit-precision complex number
                            cudensitymatScalarCallbackNone,                                 // no time-dependent coefficient associated with this operator product
                            cudensitymatScalarGradientCallbackNone));                       // no coefficient gradient associated with this operator product
      }
    }

    // Construct the full Liouvillian operator as a sum of the operator terms
    //  Create an empty operator
    HANDLE_CUDM_ERROR(cudensitymatCreateOperator(handle,
                        spaceShape.size(),                // Hilbert space rank (number of modes)
                        spaceShape.data(),                // Hilbert space shape (modes extents)
                        &liouvillian));                   // the created empty operator
    //  Append an operator term to the operator
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        oneBodyTerm,                      // appended operator term
                        0,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(1.0, 0.0),   // constant coefficient associated with the operator term as a whole
                        cudensitymatScalarCallbackNone,   // no time-dependent coefficient associated with the operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with the operator term as a whole
    //  Append an operator term to the operator (super-operator)
    HANDLE_CUDM_ERROR(cudensitymatOperatorAppendTerm(handle,
                        liouvillian,
                        twoBodyTerm,                      // appended operator term
                        0,                                // operator term action duality as a whole (0: acting from the left; 1: acting from the right)
                        make_cuDoubleComplex(1.0, 0.0),   // constant coefficient associated with the operator term as a whole
                        cudensitymatScalarCallbackNone,   // no time-dependent coefficient associated with the operator term as a whole
                        cudensitymatScalarGradientCallbackNone)); // no coefficient gradient associated with this operator term as a whole
  }

  // Destructor destructs the user-defined Liouvillian operator
  ~UserDefinedLiouvillian()
  {
    // Destroy the Liouvillian operator
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperator(liouvillian));

    // Destroy operator terms
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorTerm(twoBodyTerm));
    HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorTerm(oneBodyTerm));

    // Destroy elementary tensor operators
    HANDLE_CUDM_ERROR(cudensitymatDestroyElementaryOperator(spinZZ));
    HANDLE_CUDM_ERROR(cudensitymatDestroyElementaryOperator(spinX));

    // Destroy operator tensors
    destroyArrayGPU(spinZZelems);
    destroyArrayGPU(spinXelems);
  }

  // Disable copy constructor/assignment (GPU resources are private, no deep copy)
  UserDefinedLiouvillian(const UserDefinedLiouvillian &) = delete;
  UserDefinedLiouvillian & operator=(const UserDefinedLiouvillian &) = delete;
  UserDefinedLiouvillian(UserDefinedLiouvillian &&) = delete;
  UserDefinedLiouvillian & operator=(UserDefinedLiouvillian &&) = delete;

  /** Returns the number of externally provided Hamiltonian parameters. */
  int32_t getNumParameters() const
  {
    return 0; // no free parameters
  }

  /** Get access to the constructed Liouvillian operator. */
  cudensitymatOperator_t & get()
  {
    return liouvillian;
  }

};

Once the operator has been defined, we can follow standard steps to create necessary quantum states which will store the eigenstates of the defined operator. Then we can prepare the operator eigenspectrum computation, and, finally, compute the eigenspectrum. Note that the leading subset of the quantum states passed to the eigenspectrum compute call to store the computed eigenstates will also be used as the initial guesses for the first Krylov subspace block (if the block size is smaller than the number of requested eigenstates, only the leading quantum states will be used).

/* Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES.
 *
 * SPDX-License-Identifier: BSD-3-Clause
 */

#include <cudensitymat.h>  // cuDensityMat library header
#include "helpers.h"       // helper functions


// Transverse Ising Hamiltonian with double summation ordering and spin-operator fusion
#include "transverse_ising_full_fused.h"  // user-defined Liouvillian operator example

#include <cmath>
#include <complex>
#include <vector>
#include <chrono>
#include <iostream>
#include <cassert>


// Logging verbosity
bool verbose = true;


// Example workflow
void exampleWorkflow(cudensitymatHandle_t handle)
{
  // Define the composite Hilbert space shape and
  // quantum state batch size (number of individual quantum states in a batched simulation)
  const std::vector<int64_t> spaceShape({2,2,2,2,2,2,2,2,2,2}); // dimensions of quantum degrees of freedom
  const int64_t batchSize = 1;        // number of quantum states per batch (currently only 1 state per batch)
  const int32_t numEigenStates = 4;   // number of eigenstates to compute

  if (verbose) {
    std::cout << "Hilbert space rank = " << spaceShape.size() << "; Shape = (";
    for (const auto & dimsn: spaceShape)
      std::cout << dimsn << ",";
    std::cout << ")" << std::endl;
    std::cout << "Quantum state batch size = " << batchSize << std::endl;
  }

  // Construct a user-defined Liouvillian operator using a convenience C++ class
  UserDefinedLiouvillian liouvillian(handle, spaceShape, batchSize);
  if (verbose)
    std::cout << "Constructed the Liouvillian operator\n";

  // Create quantum states to store the eigenstates
  std::size_t stateVolume {0};
  std::vector<cudensitymatState_t> eigenStates(numEigenStates);
  std::vector<void *> eigenStatesElems(numEigenStates);
  for (int32_t id = 0; id < numEigenStates; ++id) {

    // Declare the quantum state
    HANDLE_CUDM_ERROR(cudensitymatCreateState(handle,
                        CUDENSITYMAT_STATE_PURITY_PURE,  // pure (state vector)
                        spaceShape.size(),
                        spaceShape.data(),
                        batchSize,
                        dataType,
                        &eigenStates[id]));

    // Query the size of the quantum state storage
    std::size_t storageSize {0}; // only one storage component (tensor) is needed (no tensor factorization)
    HANDLE_CUDM_ERROR(cudensitymatStateGetComponentStorageSize(handle,
                        eigenStates[id],
                        1,               // only one storage component (tensor)
                        &storageSize));  // storage size in bytes
    stateVolume = storageSize / sizeof(NumericalType);  // quantum state tensor volume (number of elements)
    if (verbose)
      std::cout << "Quantum state storage size (bytes) = " << storageSize << std::endl;

    // Prepare some initial value for the quantum state
    std::vector<NumericalType> stateValue(stateVolume);
    if constexpr (std::is_same_v<NumericalType, double>) {
      for (std::size_t i = 0; i < stateVolume; ++i) {
        stateValue[i] = 1.0 / double(id*5 + i+1); // just some value
      }
    } else if constexpr (std::is_same_v<NumericalType, std::complex<double>>) {
      for (std::size_t i = 0; i < stateVolume; ++i) {
        stateValue[i] = NumericalType{1.0 / double(id*5 + i+1), -1.0 / double(id*3 + i+2)}; // just some value
      }
    } else {
      std::cerr << "Error: Unsupported data type!\n";
      std::exit(1);
    }
    // Allocate initialized GPU storage for the quantum state with prepared values
    eigenStatesElems[id] = createInitializeArrayGPU(stateValue);
    if (verbose)
      std::cout << "Allocated quantum state storage and initialized it to some value\n";

    // Attach initialized GPU storage to the quantum state
    HANDLE_CUDM_ERROR(cudensitymatStateAttachComponentStorage(handle,
                        eigenStates[id],
                        1,                                                 // only one storage component (tensor)
                        std::vector<void*>({eigenStatesElems[id]}).data(), // pointer to the GPU storage for the quantum state
                        std::vector<std::size_t>({storageSize}).data()));  // size of the GPU storage for the quantum state
    if (verbose)
      std::cout << "Constructed quantum state\n";
  }

  // Allocate storage for the eigenvalues and convergence tolerances
  void * eigenvalues = createArrayGPU<NumericalType>(numEigenStates * batchSize);
  std::vector<double> tolerances(numEigenStates * batchSize, 1e-6);

  // Declare a workspace descriptor
  cudensitymatWorkspaceDescriptor_t workspaceDescr;
  HANDLE_CUDM_ERROR(cudensitymatCreateWorkspace(handle, &workspaceDescr));

  // Query free GPU memory
  std::size_t freeMem = 0, totalMem = 0;
  HANDLE_CUDA_ERROR(cudaMemGetInfo(&freeMem, &totalMem));
  freeMem = static_cast<std::size_t>(static_cast<double>(freeMem) * 0.95); // take 95% of the free memory for the workspace buffer
  if (verbose)
    std::cout << "Max workspace buffer size (bytes) = " << freeMem << std::endl;

  // Create the operator eigenspectrum computation object
  cudensitymatOperatorSpectrum_t spectrum;
  HANDLE_CUDM_ERROR(cudensitymatCreateOperatorSpectrum(handle,
                      liouvillian.get(),
                      1,  // Hermitian operator
                      CUDENSITYMAT_OPERATOR_SPECTRUM_SMALLEST_REAL,
                      &spectrum));

  // Prepare the operator eigenspectrum computation (needs to be done only once)
  auto startTime = std::chrono::high_resolution_clock::now();
  HANDLE_CUDM_ERROR(cudensitymatOperatorSpectrumPrepare(handle,
                      spectrum,
                      numEigenStates,
                      eigenStates[0],
                      CUDENSITYMAT_COMPUTE_64F,
                      freeMem,
                      workspaceDescr,
                      0x0));
  auto finishTime = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> timeSec = finishTime - startTime;
  if (verbose)
    std::cout << "Operator eigenspectrum preparation time (sec) = " << timeSec.count() << std::endl;

  // Query the required workspace buffer size (bytes)
  std::size_t requiredBufferSize {0};
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceGetMemorySize(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      &requiredBufferSize));
  if (verbose)
    std::cout << "Required workspace buffer size (bytes) = " << requiredBufferSize << std::endl;

  // Allocate GPU storage for the workspace buffer
  const std::size_t bufferVolume = requiredBufferSize / sizeof(NumericalType);
  auto * workspaceBuffer = createArrayGPU<NumericalType>(bufferVolume);
  if (verbose)
    std::cout << "Allocated workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Attach the workspace buffer to the workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatWorkspaceSetMemory(handle,
                      workspaceDescr,
                      CUDENSITYMAT_MEMSPACE_DEVICE,
                      CUDENSITYMAT_WORKSPACE_SCRATCH,
                      workspaceBuffer,
                      requiredBufferSize));
  if (verbose)
    std::cout << "Attached workspace buffer of size (bytes) = " << requiredBufferSize << std::endl;

  // Compute the operator eigenspectrum
  HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
  startTime = std::chrono::high_resolution_clock::now();
  HANDLE_CUDM_ERROR(cudensitymatOperatorSpectrumCompute(handle,
                      spectrum,
                      0.0,
                      batchSize,
                      0,
                      nullptr,
                      numEigenStates,
                      eigenStates.data(),
                      eigenvalues,
                      tolerances.data(),
                      workspaceDescr,
                      0x0));
  HANDLE_CUDA_ERROR(cudaDeviceSynchronize());
  finishTime = std::chrono::high_resolution_clock::now();
  timeSec = finishTime - startTime;
  if (verbose)
    std::cout << "Operator eigenspectrum computation time (sec) = " << timeSec.count() << std::endl;

  // Print the eigenvalues
  if (verbose) {
    std::cout << "Eigenvalues:\n";
    printArrayGPU<NumericalType>(eigenvalues, numEigenStates);
  }

  // Print the residual norms
  if (verbose) {
    std::cout << "Residual norms:\n";
    printArrayCPU<double>(tolerances.data(), numEigenStates);
  }

  HANDLE_CUDA_ERROR(cudaDeviceSynchronize());

  // Destroy workspace descriptor
  HANDLE_CUDM_ERROR(cudensitymatDestroyWorkspace(workspaceDescr));

  // Destroy workspace buffer storage
  destroyArrayGPU(workspaceBuffer);

  // Destroy operator eigenspectrum computation object
  HANDLE_CUDM_ERROR(cudensitymatDestroyOperatorSpectrum(spectrum));

  // Destroy eigenvalues storage
  destroyArrayGPU(eigenvalues);

  // Destroy quantum states
  for (int32_t id = 0; id < numEigenStates; ++id)
    HANDLE_CUDM_ERROR(cudensitymatDestroyState(eigenStates[id]));

  // Destroy quantum state storage
  for (int32_t id = 0; id < numEigenStates; ++id)
    destroyArrayGPU(eigenStatesElems[id]);

  if (verbose)
    std::cout << "Destroyed resources\n" << std::flush;
}


int main(int argc, char ** argv)
{
  // Assign a GPU to the process
  HANDLE_CUDA_ERROR(cudaSetDevice(0));
  if (verbose)
    std::cout << "Set active device\n";

  // Create a library handle
  cudensitymatHandle_t handle;
  HANDLE_CUDM_ERROR(cudensitymatCreate(&handle));
  if (verbose)
    std::cout << "Created a library handle\n";

  // Run the example
  exampleWorkflow(handle);

  // Destroy the library handle
  HANDLE_CUDM_ERROR(cudensitymatDestroy(handle));
  if (verbose)
    std::cout << "Destroyed the library handle\n";

  HANDLE_CUDA_ERROR(cudaDeviceReset());

  // Done
  return 0;
}

Useful tips#

For debugging, one can set the environment variable CUDENSITYMAT_LOG_LEVEL=n. The level n = 0, 1, …, 5 corresponds to the logger level as described in the table below. The environment variable CUDENSITYMAT_LOG_FILE=<filepath> can be used to redirect the log output to a custom file at <filepath> instead of stdout.

Level	Summary	Long Description
0	Off	Logging is disabled (default)
1	Errors	Only errors will be logged
2	Performance Trace	API calls that launch CUDA kernels will log their parameters and important information
3	Performance Hints	Hints that can potentially improve the application’s performance
4	Heuristics Trace	Provides general information about the library execution, may contain details about heuristic status
5	API Trace	API calls will log their parameter and important information