cuDSS Tips and tricks#

Types of errors#

Default mode#

It is helpful to understand that there are two different types of errors which can occur when using cuDSS:

host-side errors

All cuDSS API calls return a value of type cudssStatus_t which can be checked for CUDSS_STATUS_SUCCESS.

In hybrid execution mode, one can check the host side errors after a cudssExecute() via cudssDataGet() with the parameter CUDSS_DATA_INFO.
device-side errors

These can occur asynchronously and thus require extra effort to track them. The main routine of cuDSS, cudssExecute(), can detect and report some of the device-side errors if the user calls cudssDataGet() with the parameter CUDSS_DATA_INFO.

Note: calling cudssDataGet() performs a stream synchronization internally. Hence it is not advisable to call cudssDataGet() when it is not needed, as it comes at a cost of the stream synchronization. Thus, for performance reasons, it is suggested that the users insert as (few calls to cudssDataGet()) as possible, when not debugging.

Also, when debugging issues which can be related to cuDSS, the following general tips are valid:

Enable logging. Specifically, re-running with CUDSS_LOG_LEVEL=5 set in the environment often provides useful information regarding the cause of the issue.
Since cuDSS supports asynchronous execution, to pinpoint the exact location of a failing cuDSS call, one can insert synchronization points between cuDSS calls (specifically, for cudssExecute()) or, while debugging, set the environment variable CUDA_LAUNCH_BLOCKING to 1.

Specifically, one can call cudaStreamSynchronize(stream) followed by cudaGetLastError() to check for potential CUDA API errors inside cuDSS. This can be combined with checking for cuDSS-specific device side errors via cudssDataGet(), as mentioned above.

MGMN mode#

While MGMN mode of cuDSS is subject to all debugging tips described above, there are extra tips specific to this mode.

There are many different errors which can occur in the multi-GPU/multi-node (MGMN) environments. For debugging cuDSS in the MGMN mode, it is advised to check the health of the system configuration with a simple benchmark of the specific communication backend. The exact same launch parameters should be used as in the application. As an example, for NCCL one could use NCCL tests.
Communication backend which is used (together with a communication layer library linked to it) can have its own limitations. E.g., for NCCL, there can be only one process per GPU (unlike CUDA-aware OpenMPI).
In order to enable MGMN mode in cuDSS, the user should set the communication layer via cudssSetCommLayer() and the communicator via cudssDataSet() with CUDSS_DATA_COMM. If there is a mismatch between the communication backend used for building the communication layer library and the communication backend used to create the communicator (which can happen, e.g. if NCCL is used as a communication backend with MPI used to initialize the environment), errors might occur.
In case there is an issue related to the communication backend or communication layer library, it can manifest itself in different ways, from incorrect elimination tree (or some other functional issue) to error messages address cannot be mapped ... and segmentation faults.

To check whether the distributed environment is set up correctly, we suggest performing the following steps (with more than one process):
- Initialize distributed environment as in the application and call any one of the simplest APIs, like ncclCommRank() or MPI_Comm_rank() depending on the communication backend which the application is targeting.
- If the previous step works, cudaMalloc() a memory buffer on each of the processes and call a collective, e.g. ncclBcast() or MPI_Bcast(). This will check whether the communication backend is CUDA aware and is capable of transferring device buffers.
- If the previous step works, use a combination of dlopen() and dlsym() to call the same collective operation as in the previous step, using the communication layer library.
```
#include <dlfcn.h> // required for dlopen and dlsym

cudssDistributedInterface_t *commIface;

void *commIfaceLib = NULL;

char *libname = <path to the communication layer library>;

commIfaceLib = static_cast<void*>(dlopen(libname, RTLD_NOW));

if (commIfaceLib == NULL) { /* bad */ }

commIface = (cudssDistributedInterface_t*)dlsym(commIfaceLib,
    "cudssDistributedInterface");

if (commIface == NULL) { /* bad */ }

commIface->cudssBcast(...);
```
  If this experiment fails, it might be a linking issue (e.g. a mix between different communication backend libraries, or a version mismatch).
- If the previous step works, separate the code which calls dlopen() and dlsym() into the separate file, build it as a shared library and try to call the same collective in your main application code. This should not cause a different behavior from the previous step, unless the communication layer (either a pre-built one or a custom one built by the user) was not built correctly.

For the full code how to check the communication layer, see the helper sample.

MT mode#

MT (multi-threaded) mode of cuDSS is subject to all general debugging tips from the previous section.

Extra tips for the MT mode:

The MT-mode specific issues can be related to the threading layer library. It can be checked with steps similar to how a communication layer library can be checked for the MGMN mode.

Now, the role of a “simple” API (getting the process rank in the MGMN example) can be taken by omp_get_num_threads() and the role of a more “complicated” one (where the MGMN tips suggest using a broadcast) is a parallel for.
Various sorts of errors (often hard to debug) can occur in case different threading backends (including the case of different OpenMP runtimes, e.g. libgomp.so and libomp.so) are mixed within one application.

In case an application is already using a threading backend, it is recommended to use the same threading backend (via the instructions to build a custom threading layer) for MT mode of cuDSS.

For the full code how to check the threading layer, see the multi-threading test sample.

Common mistakes when using cuDSS#

This is an incomplete list of common mistakes when using cuDSS:

Inconsistent matrix data arrays (e.g., incorrectly populated indexing arrays for sparse matrices). Typically, such issues lead to a critical failures (segmentation faults or CUDA API failures) or hangs during the reordering.
Inconsistent matrix data description (e.g., when non-array parameters passed to the cudssMatrixCreate...() routine are incompatible with the values in the data buffers). For sparse matrices, this includes the matrix type and matrix view. For example if one passes CSR arrays for an upper-triangular matrix but the matrix view is set to CUDSS_MVIEW_LOWER, then only the diagonal values will be used by cuDSS. Often such issues lead to an incorrect result.

Note: cudssExecute() has additional limitations on the supported matrix data which need to be satisfied.
Missing stream synchronization when accessing host data after factorization or solve phases. This mistake might lead to sporadically incorrect results appearing during checks on the host.

Note: by default (non-MGMN and non-hybrid execution), factorization and solve phases are asynchronous.

Note: cudssDataGet() performs synchronizations necessary for the retrieved data.
Missing stream synchronization when estimating performance of factorization or solve phases using host timers. This mistake leads to wrong timings reported for these phases.

Since factorization and solve phase can be asynchronous and return control to the host before the kernel execution on the device is completed, when measuring performance of factorization and solve as separate phases one should insert a synchronization to capture the right time with a host timer. See this section for more tips about measuring cuDSS performance.

Accuracy issues#

One of the most common approaches to measure accuracy of solving a (non-singular square) linear system \(A x = b\) is checking the relative norm of the residual

\[\frac{|| r ||}{|| A || \: || x || + || b ||},\]

where \(r = A x - b\) is called the residual, and \(|| \cdot ||\) are chosen matrix and vector norms. The choice of the norms is not critical, e.g. one can use the maximum or 2-norm for vectors combined with an easy-to-compute (e.g., Frobenius norm) for matrix A.

Unless there are application-specific reasons why a different quantity (e.g., the inertia index or some function of the solution) should be used to determine whether the solution is accurate enough, we recommend using this approach.

For an example of how to compute the relative norm of the residual with a mixture of device and host code, see the residual code sample.

A known exception to this general recommendation is the situation when the exact solution is known. Then one can look at the solution error \(||x - x_{exact}||\), instead of the relative residual norm.

Note: both of the described accuracy estimators depend on the choice of the right-hand side (or exact solution). If, e.g. the system matrix has a block form and some block rows are scaled differently than the others, then choosing a uniformly scaled right-hand side might not be the best choice and may lead to overly pessimistic accuracy estimates.

Irrespective of the choice of the accuracy estimator, its value is typically compared against a chosen tolerance which depends on the application. One says that there is an accuracy issue, if the accuracy estimator value is larger than this tolerance.

There are different reasons why the accuracy may be lower than expected. We list some of them, together with the suggestions of how to fix them below:

Inconsistent data (as described in the section about the common mistakes).

Suggestion: inspect the matrix data and its description as it is passed to cuDSS.
Impractical tolerance for the solution error (e.g., not taking into account the standard error estimates for solving linear systems).

A standard linear algebra estimate

\[\frac{|| x - x_{exact} ||}{|| x_{exact} ||} \leq \kappa(A) \frac{|| b - b_{exact} ||}{|| b_{exact} ||},\]

where \(\kappa(A) = || A || \: || A^{-1} ||\) is the condition number of \(A\), and the inevitable presence of numerical errors due to inexact computer arithmetic imply that even when the right-hand side is exact, it is in general unreasonable to expect solution error to be smaller than the floating point precision multiplied by the conditioning number \(\kappa(A)\).

Suggestion: use an estimate on the conditioning number \(\kappa(A)\) to set a reasonable error bound.
If the accuracy is estimated based on the solution error \(|| x - x_{exact} ||\), one should remember the standard linear algebra estimate:

\[|| x - x_{exact} || \leq || A^{-1} || \: || r ||\]

which puts a limit onto what one can guarantee in general for the solution error knowing only the residual \(r\).

Suggestion: use the relative residual norm as a more robust error indicator.
Perturbed pivots appeared during the factorization.

For the high-level overview of how numerical pivoting works in cuDSS, see the section.

To check for the number of perturbed pivots, one can call cudssDataGet() with CUDSS_DATA_NPIVOTS. The appearance of perturbed pivots often is an indicator that the matrix is not well-conditioned.

There are multiple possible ways to mitigate the accuracy loss caused by the perturbed pivots. Suggestions:
- Increase the number of iterative refinement steps. This can be done via calling cudssConfigSet() with CUDSS_CONFIG_IR_N_STEPS. By default, cuDSS performs zero iterative refinement steps.
- Change the pivoting parameters within the same pivoting strategy (global or local), particularly, the pivoting epsilon. Often, the default pivoting epsilon might turn out to be inadequate for the magnitude of entries in the matrix.
- Enable the pivoting epsilon scaling via cudssConfigSet() with CUDSS_CONFIG_PIVOT_EPSILON_ALG. This might help to get the right magnitude of the pivoting epsilon but in case different parts of the matrix are scaled differently or if the matrix is not well-conditioned, it might not help.
- Enable matching and scaling via cudssConfigSet() with CUDSS_CONFIG_USE_MATCHING and CUDSS_CONFIG_MATCHING_ALG.
- If local partial pivoting is used (the default), switch to the global pivoting (use CUDSS_ALG_1 or CUDSS_ALG_2 for reordering algorithm). While the global pivoting (which affects performance of all phases) is usually significantly slower than the local pivoting, it is currently the most robust option.

Performance measurements#

Getting performance measurements right can be quite hard and heavily depends on the application.

As most of direct sparse solvers, cuDSS has multiple phases (or stages). The three major ones are analysis (which is reordering combined with the symbolic factorization), numerical factorization, and solve.

Performance for each of the major phases can be fundamentally different from others:

reordering phase currently is done on the host as it is known to be extremely hard to accelerate the underlying graph algorithms on a GPU.

The default reordering algorithm in cuDSS is a custom METIS-like (nested dissection) code and as such, is often closer in performance to other direct sparse solvers which also use the CPU for reordering.

There is a limited number of ways to speedup the default reordering in cuDSS:
- Enabling multi-threaded reordering in MT mode
- Changing the reordering algorithm parameters for nested dissection, specifically, CUDSS_CONFIG_ND_NLEVELS (levels of nested dissection depth). Note: changing the number of levels can have a profound effect on factorization and solve phase performance.
- In case there is a batch of systems to solve (either different ones or with the same sparsity structure), one can use the non-uniform / uniform batch features of cuDSS to re-use results of the reordering.
- In case the application (for any reason) needs to call reordering phase on the same sparsity pattern multiple times, one can extract the reordering results from the previous call and use them for the subsequent calls using the output parameters CUDSS_DATA_PERM_REORDER_ROW and CUDSS_DATA_PERM_REORDER_COL and CUDSS_DATA_ELIMINATION_TREE with CUDSS_DATA_USER_ELIMINATION_TREE. Note: this is only supported when reordering algorithm is not set to CUDSS_ALG_1 or CUDSS_ALG_2.
Other reordering options are typically slower and should be used when there is an argument not related to performance of the reordering phase which motivates the change.

Note: for some matrices, it can be that reordering phase takes most of the total time and thus, acceleration of other phases on a GPU might not be visible if one only compares the total time. In such case, it is recommended to check if the measured workflow is a reasonable proxy of the target application and whether the considered matrix is representative of the matrices used in practice.
symbolic factorization is always computed on the device.

Unless the reordering permutation leads to a very sequential elimination tree structure, symbolic factorization should be quicker than the reordering and thus should not be a bottleneck.
By default, numerical factorization is performed on the device. In hybrid execute mode, it can partially be performed on the host, and in hybrid memory mode, it can be done on the device while using a buffer on the host.

The factorization phase is the most computationally intense. However, the intensity heavily depends on the structure of the factors. Therefore, performance of this phase may turn out to be either memory bandwidth bound (e.g., if the factors are sparse, which can be estimated by querying the number of nonzero entries in the factors via calling cudssDataGet() with CUDSS_DATA_LU_NNZ after the symbolic factorization) or compute bound.
solve is by default performed on the device (with the same remark about hybrid execute mode and hybrid memory mode as for the factorization phase)

Typically, solve phase require much fewer floating point operations than the factorization phase and thus often exhibits performance behavior similar to a sparse triangular solve. It is much more often memory bandwidth bound (especially in case of a single right-hand side, mainly for passing the factors through the kernel) or synchronization bound (as the parallelism in the sparse triangular solve can be limited by the dependencies).

The following tips are valid for measuring cuDSS performance:

Combining the phases may increase performance, but it only allows to capture the total time. To get the full picture, we recommend measuring performance of each phase individually in a separate benchmark.
If the host timers are used and factorization and/or solve phases are executed on device, one should account for the potential asynchronous execution for the factorization and solve phases. Specifically (as also described in the common mistakes section), one should add a stream synchronization before stopping the host timer.
Depending on the application, either “cold” or “warm” runs should be measured. As a rule of thumb, we recommend doing at least one warm-up run for the factorization and solve phases, unless the matrices are sufficiently large to neglect the cache effects and remove the overhead of allocating memory (if the call to a phase is repeated with the same settings, then cuDSS would re-use most or all of the data buffers).
Allocating and moving data to and from the GPU can take a lot of time. While some of the memory operations need to happen every time, a portion of them need to only be done once. Thus, having a warm-up or using a device memory pool via cudssDeviceMemHandler can help move the memory costs outside the application’s critical path. Especially smaller matrices can gain a lot of performance that way.
In case GPU utilization (which can be checked using performance profilers like Nsight Compute) is lower than expected, there are multiple ways to increase it by changing how the application calls cuDSS.
- Re-using analysis (reordering and symbolic factorization) and numerical factorization by only calling the corresponding phases when the matrix structure or values change (in the latter case, one should call cudssExecute() with CUDSS_PHASE_REFACTORIZATION), respectively.
- Re-using reordering and symbolic factorization and / or numerical factorization through uniform batch API when multiple systems with the same sparsity structure need to be solved.
- Solving multiple systems via non-uniform batch APIs
- Solving with multiple right-hand side vectors at the same time
- Performing factorization and/or solve on the host (and potentially providing host buffers for the inputs) by using the hybrid execute mode (only works for smaller matrices).