cuDSS Advanced Features#

Hybrid host/device memory mode#

The main idea of the hybrid host/device memory mode is to overcome the limitation of the default (non-hybrid) mode that L and U factors (which comprise the largest part of the total device memory consumed by cuDSS) must fit into the device memory. To this purpose, the hybrid memory mode allows keeping only a (smaller) portion of L and U in the device memory at any given time moment, while keeping the entire factors only in the host memory.

While the hybrid memory mode comes with an additional cost of extra host-to-device and device-to-host memory transfers (and thus will be slower than the default mode, if the factors fit into the device memory), it can help solve larger systems which can not be processed in the default mode.

Users can enable or disable the hybrid memory mode via calling cudssConfigSet() with the setting CUDSS_CONFIG_HYBRID_MODE. Since using the hybrid mode implies changes to be done during the analysis phase, enabling the hybrid mode must be done before the call to cudssExecute() with CUDSS_PHASE_ANALYSIS.

Once the hybrid mode is enabled, there are several ways of using it:

  • First way is to rely on the internal heuristic of cuDSS which can determine how much device memory the hybrid mode can consume. In this case cuDSS will assume that it can use the entire GPU memory and set the device memory limit based on the device properties (which might not be the accurate estimate as driver or other applications can reserve space on the GPU).

  • Second way allows the users to have more control over the device memory consumption. The users can set the device memory limit by calling cudssConfigSet() with CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT. Optionally, the users can also query the minimal amount of device memory which the hybrid memory mode would need, by calling cudssDataGet() with CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN. The device memory limit can be set after the analysis phase but must be set before the factorization phase.

    Note: CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT and CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN account for the total device memory needed for cuDSS, not just the factors.

Limitations:

  • Factors L and U (together with all necessary internal arrays) must fit into the host memory.

  • Hybrid memory mode must be enabled before the analysis phase.

  • Currently, hybrid memory mode adds extra synchronization of the CUDA stream, in all phases.

  • By default, hybrid memory mode uses cudaHostRegister()/cudaHostUnregister() (if the device supports it). As this sometimes can be slower than not using the host registered memory, there is a setting CUDSS_CONFIG_USE_CUDA_REGISTER_MEMORY to enable/disable usage of cudaHostRegister().

  • Currently, hybrid memory mode is not supported when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering, or, when CUDSS_ALG_1 is used for the factorization.

Hybrid host/device execute mode#

Hybrid execute mode allows cuDSS to perform calculations on both GPU and CPU. Currently it is used to speed up execution parts with low parallelization capacity. For now, we recommend this feature for factorization and solving of small matrices.

Users can enable or disable the hybrid memory mode via calling cudssConfigSet() with the setting CUDSS_CONFIG_HYBRID_EXECUTE_MODE. Since using the hybrid execute mode implies changes to be done during the analysis phase, enabling the hybrid execute mode must be done before the call to cudssExecute() with CUDSS_PHASE_ANALYSIS.

Once the hybrid execute mode is enabled, the input matrix, right hand side and solution can be host memory pointers. For the input CSR matrix, csr_offset, csr_columns and csr_values can be host or device memory pointers independently. For example, csr_offsets and csr_columns can be device memory when csr_values is host memory.

Limitations:

  • Hybrid execute mode must be enabled before the analysis phase.

  • Currently, hybrid execute mode adds extra synchronization of the CUDA stream in all phases.

  • Currently, hybrid execute mode is not supported when CUDSS_CONFIG_HYBRID_MODE or MGMN mode is used, or, when batchCount is greater than 1, or, when CUDSS_CONFIG_REORDERING_ALG is set to CUDSS_ALG_1 or CUDSS_ALG_2. Also, it is not supported when nrhs is greater than 1 or when uniform batch options are enabled.

  • The system’s input matrix should not change between the end of (RE)FACTORIZATION and the start of the corresponding SOLVE phase call(s). It might still work depending on whether the matrix arrays were on the host or device and where the solve phase computations happen but there are no guarantees.

Multi-GPU multi-node (MGMN) mode#

The idea of the multi-GPU multi-node (MGMN) mode is to allow using multiple GPU devices in cuDSS by means of distributed computations with multiple processes. For flexibility, the MGMN mode is enabled by abstracting away all communication-specific primitives into a small separately built shim communication layer. Users can have their own implementation of the communication layer with the communication backend of their choice (MPI, NCCL, etc.).

Since the shim communication layer abstracts away all communication-specific operations and is only loaded in runtime when MGMN mode is requested, the enabled MGMN execution in cuDSS does not require any changes for applications which do not make use of the MGMN mode.

MGMN mode supports 1D row-wise distribution (with overlapping) for the input CSR matrix, or dense right hand side or solution (cudssMatrixSetDistributionRow1d())

Enabling MGMN execution in the user application code consists of two steps:

  • First, users should set the communication layer library via calling cudssSetCommLayer(). This can be either one of the prebuilt communication layers from the cuDSS package or a custom user-built library. In case cudssSetCommLayer() is called with NULL provided in place of the communication layer library name, this routine attempts to read the name from the environment variable CUDSS_COMM_LIB.

    Note: the communication layer library is set for the library handle and is used for all execution calls which involve the modified handle.

  • Second, users should set the communicator to be used by cuDSS MGMN mode via calling cudssDataSet() with CUDSS_DATA_COMM parameter name. The type of the communicator must match the underlying communication backend. E.g., if OpenMPI is used, the communicator should be the OpenMPI communicator, otherwise the crash will likely occur.

    Note: since MGMN mode can support different (incl. user-implemented) communication layers, the limitations of using specific underlying communication backend (e.g. MPI or NCCL) apply.

    Note: all processes which participate in solving the system via cudssExecute() must have the same settings in the corresponding cudssConfig_t

Limitations:

  • Communication backend underlying the cuDSS communication layer must be GPU-aware in the sense that all distributed interface APIs implementation must be accepting device memory buffers and respect the stream ordering (most of the communication layer APIs take an argument of type cudaStream_t, see the definition of cudssDistributedInterface_t).

  • In MGMN mode all processes must have correct matrix shapes.

    In distributed case (cudssMatrixSetDistributionRow1d()) all processes must have global matrix shapes, while data buffers hold the local matrices.

  • MGMN mode is not supported when either CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.

  • MGMN mode does not support matrix batches.

  • All phases in MGMN mode are synchronous.

Communication layer library in cuDSS#

The purpose of the communication layer in cuDSS is to abstract away all communication primitives with a small set of only necessary operations built into a standalone shared library which is loaded in runtime when MGMN mode is enabled for cuDSS.

On platforms where MGMN mode is supported, distribution packages for cuDSS include prebuilt communication libraries libcudss_commlayer_openmpi.so and libcudss_commlayer_nccl.so for OpenMPI and NCCL, respectively. Also included are the source code of the implementation, cudss_commlayer_openmpi.cu and cudss_commlayer_nccl.cu, and a script cudss_build_commlayer.sh with an example of how a communication layer library can be built. These source files and the script are provided for demonstration purposes and can be used as a guidance for developing custom implementations.

Once a communication layer is implemented and built into a small standalone shared library, the application should enable MGMN mode for cuDSS via the steps mentioned above.

Note: if the communication layer depends on other shared libraries, such dependencies must be available at link time or runtime (depending on the communication layer implementation). For example, prebuilt for OpenMPI libcudss_commlayer_openmpi.so depends on libmpi.so.40 to be found at link time, so an application should be additionally linked to the OpenMPI shared library if it is using this communication layer for cuDSS.

Communication layer (distributed interface) API in cuDSS#

The communication layer in cuDSS can be thought of as a wrapper around all necessary communication primitives required by cuDSS MGMN mode. While user-defined implementations are supported, the distributed interface API is fixed. The API is separated into the header cudss_distributed_interface.h.

Specifically, the distributed interface API is encoded in the definition of the type cudssDistributedInterface_t and a valid communication layer implementation must define the symbol with name cudssDistributedInterface of this type. If such a symbol is not found, calling cudssSetCommLayer() will result in an error.

Multi-GPU (MG) mode#

The idea of the multi-GPU (MG) mode is to allow using multiple GPU devices within the same node in cuDSS without any communication layer.

Enabling MG execution in the user application code requires calling of cudssCreateMg() and cudssConfigSet() with CUDSS_CONFIG_DEVICE_COUNT and CUDSS_CONFIG_DEVICE_INDICES.

Limitations:

  • Using MG mode jointly with MGMN mode is not supported.

  • Distributed input is not supported (cudssMatrixSetDistributionRow1d())

  • MG mode is not supported when either CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.

  • MG mode does not support matrix batches.

  • All phases in MG mode are synchronous.

Multi-Threaded (MT) mode#

The idea of the multi-threaded mode is to allow using multiple CPU threads in cuDSS. For flexibility, the MT mode is enabled by abstracting away all threading-specific primitives into a small separately built shim threading layer. Users can have their own implementation of the threading layer with the threading backend of their choice (OpenMP, pthreads, TBB, etc.).

Enabling MT execution in the user application code:

Users should set the threading layer library via calling cudssSetThreadingLayer(). This can be either one of the prebuilt threading layers from the cuDSS package or a custom user-built library. In case cudssSetThreadingLayer() is called with NULL provided in place of the threading layer library name, this routine attempts to read the name from the environment variable CUDSS_THREADING_LIB.

Note: the threading layer library is set for the library handle and is used for all execution calls which involve the modified handle.

Note: since MT mode can support different (incl. user-implemented) threading layers, the limitations of using specific underlying threading backend (e.g. OpenMP or pthreads) apply.

Threading layer library in cuDSS#

The purpose of the threading layer in cuDSS is to abstract away all multi-threading primitives with a small set of only necessary operations built into a standalone shared library which is loaded in runtime when MT mode is enabled for cuDSS.

On platforms where MT mode is supported, distribution packages for cuDSS include prebuilt threading library libcudss_mtlayer_gomp.so for GNU OpenMP. Also included are the source code of the implementation, cudss_mtlayer_gomp.cu and a script cudss_build_mtlayer.sh with an example of how a threading layer library can be built. These source files and the script are provided for demonstration purposes and can be used as a guidance for developing custom implementations.

Once a threading layer is implemented and built into a small standalone shared library, the application should enable MT mode for cuDSS via the steps mentioned above.

Note: if the threading layer depends on other shared libraries, such dependencies must be available at runtime (depending on the threading layer implementation).

Threading layer API in cuDSS#

The threading layer in cuDSS can be thought of as a wrapper around all necessary threading primitives required by cuDSS MT mode. While user-defined implementations are supported, the threading interface API is fixed. The API is separated into the header cudss_threading_interface.h.

Specifically, the threading interface API is encoded in the definition of the type cudssThreadingInterface_t and a valid threading layer implementation must define the symbol with name cudssThreadingInterface of this type. If such a symbol is not found, calling cudssSetThreadingLayer() will result in an error.

User-provided elimination tree data (saving reordering)#

One way of saving results of the reordering phase (which can be time consuming as it is executed on the host) is to extract the resulting reordering permutation by calling cudssDataGet() with CUDSS_DATA_PERM_REORDER_ROW parameter name and then supplying it to the future calls to cudssExecute() with CUDSS_PHASE_REORDERING by calling cudssDataSet() with the CUDSS_DATA_USER_PERM parameter.

However, performance of the cuDSS factorization and solve phases in the subsequent calls can be significantly slower than in the first call, as the auxiliary elimination tree structure information is lost when only the permutation vector is re-used.

To restore the full performance, there is an option to save the aforementioned elimination tree data by calling cudssDataGet() with CUDSS_DATA_ELIMINATION_TREE, which can then be passed to subsequent calls to cudssDataSet() for phase CUDSS_PHASE_REORDERING by calling cudssDataSet with CUDSS_DATA_USER_ELIMINATION_TREE [currently not supported].

While this is the recommended way to use CUDSS_DATA_ELIMINATION_TREE and CUDSS_DATA_USER_ELIMINATION_TREE, advanced users can also provide the elimination tree information computed outside of cuDSS. To this end, we give the formal definition below and an example.

The elimination tree is a binary tree with \(k\) = CUDSS_CONFIG_ND_NLEVELS levels. Then, the elimination tree has \(2 ^ k - 1\) nodes, among them \(2 ^ {k-1}\) leaf nodes and \(2 ^ {k-1} - 1\) non-leaf nodes. The elimination tree integer array contains the so called separator (and leaf nodes) sizes (this term is relevant for nested dissection reordering, also called sizes in the source code of METIS library).

Each node of the elimination tree is associated with a contiguous subset of columns(rows) of the matrix, whose dependencies on other subsets of this kind are represented by the tree structure. Dependencies here mean dependencies through the sparsity structure of the matrix as it is processed during the LU (or Cholesky)factorization algorithm. The size of these subsets are the values in the elimination tree integer array.

More specifically, the integer array should satisfy the following properties (for these properties, we assume 0-base indexing):

  • The array has \(2 ^ k - 1\) elements.

  • The root node is at index 0 and has a height of \(k-1\).

  • All leaf nodes have a height of \(0\).

  • Both child nodes of a parent with height \(h\) have a height of \(h-1\)

  • The left child of node \(i\) is at index \(i+1\).

  • The right child of node \(i\) with height \(h\) is at index \(i + 2^h\)

  • All entries must be non-negative integers corresponding to the size of the contiguous subset of columns(rows) of the matrix which correspond to the separator (or leaf node).

  • Dependencies between the subsets should correspond to the tree structure. E.g., subsets for the leaf nodes must be mutually independent.

Note: this feature is not supported for the reordering algorithms CUDSS_ALG_1 and CUDSS_ALG_2.

Numerical pivoting#

Numerical pivoting is one of the most commonly used ways to handle small values encountered on the diagonal during the factorization.

While most of the time the default pivoting strategy is sufficient and does not incur a significant accuracy drop or performance degradation, there are cases where the negative effects are more pronounced and thus understanding the existing non-default options may be useful.

There are two key components for numerical pivoting: the strategy for finding pivot elements and the strategy for handling small values on the diagonal which remain after the pivot elements are found.

The default strategy for finding pivot elements in cuDSS can be classified as local partial pivoting when the reordering algorithm is CUDSS_ALG_DEFAULT or CUDSS_ALG_3, and global pivoting when the reordering algorithm is CUDSS_ALG_1 or CUDSS_ALG_2.

Finding pivot elements#

Default strategy for finding pivot elements in cuDSS:

  • When reordering algorithm is CUDSS_ALG_DEFAULT or CUDSS_ALG_3 (local partial pivoting):

    Local partial pivoting means that cuDSS is searching for the pivot element within the diagonal sub-block of the relatively small set of columns (usually called a supernode).

    For the general (non-symmetric) matrices, cuDSS does complete supernode pivoting, which means that it finds the maximum element in the entire diagonal sub-block.

    For symmetric/Hermitian indefinite (i.e., non positive-definite) matrices, cuDSS finds the maximum element within the diagonal sub-block.

    Note: for symmetric/Hermitian positive-definite matrices, cuDSS does not perform numerical pivoting since the matrix type passed by the user suggests that no pivoting is needed.

  • When reordering algorithm is CUDSS_ALG_1 or CUDSS_ALG_2 (global pivoting):

    These reordering algorithms create large diagonal sub-blocks and global pivoting is used to find the pivot element within the entire sub-matrix.

    Global pivoting means that cuDSS is searching for the pivot element within the entire column of the sub-matrix (which is yet to be factorized), independently of the matrix type.

Users can change the strategy for finding pivot elements via cudssConfigSet() with CUDSS_CONFIG_PIVOT_TYPE:

  • Setting the configuration parameter to CUDSS_PIVOT_NONE will disable the search for pivot elements.

  • Setting the configuration parameter to CUDSS_PIVOT_COL (default) or CUDSS_PIVOT_ROW defines whether the global pivoting is performed on columns or rows.

Default strategy for handling small values on the diagonal in cuDSS#

If, after the pivot elements are found, the values on the diagonal of the sub-matrix being factorized are still too small, cuDSS will replace them with the pivoting epsilon.

By default, cuDSS will compare the magnitude of the diagonal element with the pivoting epsilon, and replace it by appropriately signed pivoting epsilon, if the value is smaller. In such case, the pivot is called a perturbed pivot.

The standard ways to mitigate situations when accuracy drops below accepted tolerance due to the appearance of perturbed pivots are discussed in the section.

Note: the choice of pivoting epsilon heavily depends on the application. The effect of replacing small values with pivoting epsilon is getting more control over the numerical stability (restricting the growth of the round-off errors during the factorization process) at the cost of potentially lower accuracy.

Users can change the way the small pivots are handled by cuDSS in the following ways:

Schur complement#

In this section we describe how Schur complement matrix can be computed using cuDSS and how to solve the full system using it.

Let us consider the system of linear algebraic equations:

\[Ax = b,\]

where:

  • A is the sparse input matrix,

  • b is the (dense) right-hand side vector (or matrix),

  • x is the (dense) solution vector (or matrix).

Suppose that the system can be partitioned, potentially after some permutation of rows and columns, into a 2x2 block form:

\[\begin{split}A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} b_1 \\ b_2 \end{pmatrix}\end{split}\]

Then the Schur complement S for the block \(A_{22}\) is given by:

\[S = A_{22} - A_{21} A_{11}^{-1} A_{12}\]

First, to enable the Schur complement computation, users should call cudssConfigSet() with CUDSS_CONFIG_SCHUR_MODE parameter name and value 1.

Note: Schur complement computation is not supported:

  • when user permutation is set, or matching is enabled;

  • for reordering algorithms CUDSS_ALG_1 or CUDSS_ALG_2 and for factorization algorithm CUDSS_ALG_1;

  • in combination with MGMN or multi-GPU mode;

  • when either CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering;

  • for uniform and non-uniform batches.

When the Schur complement mode is enabled, users should define the indices of rows and columns which correspond to the place of the Schur complement in the system matrix (in the block form above, it is the block \(A_{22}\)). This can be done via cudssDataSet() with CUDSS_DATA_USER_SCHUR_INDICES parameter name. The data buffer for this call should be an integer array of size n, where n is the number of rows/columns of the system matrix. The values should be equal to 1 for the rows/columns which are part of the Schur complement, and 0 for the rest.

Note: the Schur complement mode must be enabled and the corresponding indices must be set before the analysis phase.

After the symbolic factorization, users can call cudssDataGet() with CUDSS_DATA_SCHUR_SHAPE parameter name to get the shape of the Schur complement matrix. The shape is returned as an integer array of size 3, where the first two elements are the number of rows and number of columns for the Schur complement matrix, and the third element is the number of nonzero values in the Schur complement matrix (which is only relevant for exporting the Schur complement matrix in CSR format).

Note: in case when the Schur complement matrix is exported as a dense matrix, the shape is trivial (number of rows/columns is the same as the number of nonzero values in the Schur complement index vector) and one can skip querying CUDSS_DATA_SCHUR_SHAPE.

Once the shape of the Schur complement matrix is known, users can allocate memory buffers for the Schur complement matrix and create an object of type cudssMatrix_t with the corresponding shape and allocated buffers.

The target format of the Schur complement matrix is determined by the matrix format of the user-created cudssMatrix_t object.

After the Schur complement indices are set, users can call cudssExecute() to perform reordering, symbolic factorization and numerical factorization.

Note: When the Schur complement mode is enabled, the numerical factorization will not be complete. The part which corresponds to the Schur complement will be only updated but not factorized.

After the factorization, user can call cudssDataGet() with CUDSS_DATA_SCHUR_MATRIX parameter name to get back the Schur complement matrix.

Solving the Schur complement system or the full system#

In case the Schur complement system needs to be solved, one can continue the previously described workflow with the next steps.

In the previously introduced notations, essentially, the system \(Ax = b\) is replaced by the equivalent system:

\[\begin{split}\begin{matrix} S x_2 = (A_{22} - A_{21} A_{11}^{-1} A_{12}) x_2 = b_2 - A_{21} A_{11}^{-1} b_1 \\ A_{11} x_1 = b_1 - A_{12} x_2 \end{matrix}\end{split}\]

First, one might need (depending on the application) to form the condense right-hand side \(b_2 - A_{21} A_{11}^{-1} b_1\), corresponding to the Schur complement, from the original system’s right-hand side.

In order to do that, one needs to call cudssExecute() with CUDSS_PHASE_SOLVE_FWD_PERM | CUDSS_PHASE_SOLVE_FWD | CUDSS_PHASE_SOLVE_DIAG phase to perform the partial forward (and diagonal, if matrix type is symmetric or Hermitian, since cuDSS for these matrix types performs \(LDL^T(H)\) factorization) solve up to the Schur complement (similar to how the factorization is stopped early). Note, that in order to make cuDSS stop at the right time, it is important to have the Schur complement mode enabled in the configuration object.

After the partial forward solve, the last \(n_{s}\) elements of the right-hand side will comprise the condensed right-hand side for the Schur complement system. Here \(n_{s}\) is the number of rows/columns of the Schur complement matrix.

Once the right-hand side for the Schur complement system is formed, one can solve the Schur complement system with an external solver.

If the Schur complement is sparse, one can use cuDSS again (treating it as a new independent system). If it is dense, one can use a dense solver, e.g. from cuSOLVER library.

In case the user wants to get the solution of the full (original) system, one needs to perform a partial backward solve. To do this, the solution of the Schur complement system should be put into the last \(n_{s}\) elements of the solution to the partial forward solve from the previous step.

Then the user should call cudssExecute() with CUDSS_PHASE_SOLVE_BWD | CUDSS_PHASE_SOLVE_BWD_PERM phase to perform the partial backward solve (the step \(A_{11} x_1 = b_1 - A_{12} x_2\)).

Full code example which demonstrates the described workflow can be found in this Schur complement sample.