Getting Started

In this section, we show how to implement quantum computing simulation using cuStateVec. First, we describe how to install the library and how to compile it. Then, we present an example code to perform common steps in cuStateVec.

Installation and Compilation

Install cuStateVec from conda-forge

If you already have a Conda environment set up (if not, Miniforge/Mambaforge is a great starting point), cuStateVec can be easily installed from the conda-forge channel:

conda install -c conda-forge custatevec

Alternatively, you can install cuQuantum which contains both cuStateVec and cuTensorNet from the conda-forge channel:

conda install -c conda-forge cuquantum

If you need to select the target CUDA version, use the new cuda-version package:

conda install -c conda-forge cuquantum cuda-version=11

In any case, the conda solver will install all required dependencies for you. The package is installed to the current $CONDA_PREFIX, so you can simply update the environment variable as follows:



Be aware that it is not recommended to include Conda environment paths (such as $CONDA_PREFIX) as part of your LD_LIBRARY_PATH. It might be unsafe depending on your use case.

Install cuStateVec from NVIDIA DevZone

The cuQuantum package (which cuStateVec is part of) can be downloaded from

Taking the tarball distribution as an example, once the tarball file is downloaded to CUQUANTUM_ROOT directory, you can unpack it via:

tar zxvf

and update the environment variable:

export CUQUANTUM_ROOT=/path/to/where/tarball/is/unpacked


Assuming cuQuantum has been extracted in CUQUANTUM_ROOT, we update the library path accordingly:


We can compile the sample code we will discuss below ( via the following command:

nvcc -I${CUQUANTUM_ROOT}/include -L${CUQUANTUM_ROOT}/lib -lcustatevec -o statevec_example


Depending on the source of the cuQuantum package, you may need to replace lib above by lib64.

Code Example

The following code example shows the common steps to use cuStateVec. Here we apply a Toffoli gate, which inverts the third bit when the first two bits are both 1.

#include <cuda_runtime_api.h> // cudaMalloc, cudaMemcpy, etc.
#include <cuComplex.h>        // cuDoubleComplex
#include <custatevec.h>       // custatevecApplyMatrix
#include <stdio.h>            // printf
#include <stdlib.h>           // EXIT_FAILURE

int main(void) {

   const int nIndexBits = 3;
   const int nSvSize    = (1 << nIndexBits);
   const int nTargets   = 1;
   const int nControls  = 2;
   const int adjoint    = 0;

   int targets[]  = {2};
   int controls[] = {0, 1};

   cuDoubleComplex h_sv[]        = {{ 0.0, 0.0}, { 0.0, 0.1}, { 0.1, 0.1},
                                    { 0.1, 0.2}, { 0.2, 0.2}, { 0.3, 0.3},
                                    { 0.3, 0.4}, { 0.4, 0.5}};
   cuDoubleComplex h_sv_result[] = {{ 0.0, 0.0}, { 0.0, 0.1}, { 0.1, 0.1},
                                    { 0.4, 0.5}, { 0.2, 0.2}, { 0.3, 0.3},
                                    { 0.3, 0.4}, { 0.1, 0.2}};
   cuDoubleComplex matrix[] = {{0.0, 0.0}, {1.0, 0.0},
                               {1.0, 0.0}, {0.0, 0.0}};

   cuDoubleComplex *d_sv;
   cudaMalloc((void**)&d_sv, nSvSize * sizeof(cuDoubleComplex));

   cudaMemcpy(d_sv, h_sv, nSvSize * sizeof(cuDoubleComplex),


   // custatevec handle initialization
   custatevecHandle_t handle;


   void* extraWorkspace = nullptr;
   size_t extraWorkspaceSizeInBytes = 0;

   // check the size of external workspace
       handle, CUDA_C_64F, nIndexBits, matrix, CUDA_C_64F,
       CUSTATEVEC_MATRIX_LAYOUT_ROW, adjoint, nTargets, nControls,
       CUSTATEVEC_COMPUTE_64F, &extraWorkspaceSizeInBytes);

   // allocate external workspace if necessary
   if (extraWorkspaceSizeInBytes > 0)
       cudaMalloc(&extraWorkspace, extraWorkspaceSizeInBytes);

   // apply gate
       handle, d_sv, CUDA_C_64F, nIndexBits, matrix, CUDA_C_64F,
       CUSTATEVEC_MATRIX_LAYOUT_ROW, adjoint, targets, nTargets, controls,
       nullptr, nControls, CUSTATEVEC_COMPUTE_64F,
       extraWorkspace, extraWorkspaceSizeInBytes);

   // destroy handle


   cudaMemcpy(h_sv, d_sv, nSvSize * sizeof(cuDoubleComplex),

   bool correct = true;
   for (int i = 0; i < nSvSize; i++) {
       if ((h_sv[i].x != h_sv_result[i].x) ||
           (h_sv[i].y != h_sv_result[i].y)) {
           correct = false;

   if (correct)
       printf("example PASSED\n");
       printf("example FAILED: wrong result\n");

   if (extraWorkspaceSizeInBytes)

   return EXIT_SUCCESS;

More samples can be found in the NVIDIA/cuQuantum repository.

Useful tips

  • For debugging, the environment variable CUSTATEVEC_LOG_LEVEL=n can be set. The level n = 0, 1, …, 5 corresponds to the logger level as described and used in custatevecLoggerSetLevel(). The environment variable CUSTATEVEC_LOG_FILE=<filepath> can be used to direct the log output to a custom file at <filepath> instead of stdout.