cuDSS Advanced Features#

Hybrid host/device memory mode#

The main idea of the hybrid host/device memory mode is to overcome the limitation of the default (non-hybrid) mode that L and U factors (which comprise the largest part of the total device memory consumed by cuDSS) must fit into the device memory. To this purpose, the hybrid memory mode allows keeping only a (smaller) portion of L and U in the device memory at any given time moment, while keeping the entire factors only in the host memory.

While the hybrid memory mode comes with an additional cost of extra host-to-device and device-to-host memory transfers (and thus will be slower than the default mode, if the factors fit into the device memory), it can help solve larger systems which can not be processed in the default mode.

Users can enable or disable the hybrid memory mode via calling cudssConfigSet() with the setting CUDSS_CONFIG_HYBRID_MODE from the enum cudssConfigParam_t. Since using the hybrid mode implies changes to be done during the analysis phase, enabling the hybrid mode must be done before the call to cudssExecute() with CUDSS_PHASE_ANALYSIS.

Once the hybrid mode is enabled, there are several ways of using it:

First way is to rely on the internal heuristic of cuDSS which can determine how much device memory the hybrid mode can consume. In this case cuDSS will assume that it can use the entire GPU memory and set the device memory limit based on the device properties (which might not be the accurate estimate as driver or other applications can reserve space on the GPU).
Second way allows the users to have more control over the device memory consumption. The users can set the device memory limit by calling cudssConfigSet() with CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT from the enum cudssConfigParam_t. Optionally, the users can also query the minimal amount of device memory which the hybrid memory mode would need, by calling cudssDataGet() with CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN from the enum cudssDataParam_t. The device memory limit can be set after the analysis phase but must be set before the factorization phase.

Note: CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT and CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN account for the total device memory needed for cuDSS, not just the factors.

Limitations:

Factors L and U (together with all necessary internal arrays) must fit into the host memory.
Hybrid memory mode must be enabled before the analysis phase.
Currently, hybrid memory mode adds extra synchronization of the CUDA stream, in all phases.
By default, hybrid memory mode uses cudaHostRegister()/cudaHostUnregister() (if the device supports it). As this sometimes can be slower than not using the host registered memory, there is a setting CUDSS_CONFIG_USE_CUDA_REGISTER_MEMORY to enable/disable usage of cudaHostRegister().
Currently, hybrid memory mode is not supported when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering, or, when CUDSS_ALG_1 is used for the factorization.

Hybrid host/device execute mode#

Hybrid execute mode allows cuDSS to perform calculations on both GPU and CPU. Currently it is used to speed up execution parts with low parallelization capacity. For now, we recommend this feature for factorization and solving of small matrices.

Users can enable or disable the hybrid memory mode via calling cudssConfigSet() with the setting CUDSS_CONFIG_HYBRID_EXECUTE_MODE from the enum cudssConfigParam_t. Since using the hybrid mode implies changes to be done during the analysis phase, enabling the hybrid execute mode must be done before the call to cudssExecute() with CUDSS_PHASE_ANALYSIS.

Once the hybrid execute mode is enabled, the input matrix, right hand side and solution can be host memory pointers. For the input CSR matrix, csr_offset, csr_columns and csr_values can be host or device memory pointers independently. For example, csr_offsets and csr_columns can be device memory when csr_values is host memory.

Limitations:

Hybrid execute mode must be enabled before the analysis phase.
Currently, hybrid execute mode adds extra synchronization of the CUDA stream, in all phases.
Currently, hybrid execute mode is not supported when CUDSS_CONFIG_HYBRID_MODE or MGMN mode is used, or, when batchCount is greater than 1, or, when CUDSS_CONFIG_REORDERING_ALG is set to CUDSS_ALG_1 or CUDSS_ALG_2
Host memory data is supported for nrhs = 1 only.
The system’s input matrix should not change between the end of (RE)FACTORIZATION and the start of the corresponding SOLVE phase call(s). It might still work depending on whether the matrix arrays were on the host or device and where the solve phase computations happen but there are no guarantees.

Multi-GPU multi-node (MGMN) mode#

The idea of the multi-GPU multi-node (MGMN) mode is to allow using multiple GPU devices in cuDSS by means of distributed computations with multiple processes. For flexibility, the MGMN mode is enabled by abstracting away all communication-specific primitives into a small separately built shim communication layer. Users can have their own implementation of the communication layer with the communication backend of their choice (MPI, NCCL, etc.).

Since the shim communication layer abstracts away all communication-specific operations and is only loaded in runtime when MGMN mode is requested, the enabled MGMN execution in cuDSS does not require any changes for applications which do not make use of the MGMN mode.

MGMN mode supports 1D row-wise distribution (with overlapping) for the input CSR matrix, or dense right hand side or solution (cudssMatrixSetDistributionRow1d())

Enabling MGMN execution in the user application code consists of two steps:

First, users should set the communication layer library via calling cudssSetCommLayer(). This can be either one of the prebuilt communication layers from the cuDSS package or a custom user-built library. In case cudssSetCommLayer() is called with NULL provided in place of the communication layer library name, this routine attempts to read the name from the environment variable CUDSS_COMM_LIB.

Note: the communication layer library is set for the library handle and is used for all execution calls which involve the modified handle.
Second, users should set the communicator to be used by cuDSS MGMN mode via calling cudssDataSet() with CUDSS_DATA_COMM parameter name. The type of the communicator must match the underlying communication backend. E.g., if OpenMPI is used, the communicator should be the OpenMPI communicator, otherwise the crash will likely occur.

Note: since MGMN mode can support different (incl. user-implemented) communication layers, the limitations of using specific underlying communication backend (e.g. MPI or NCCL) apply.

Note: all processes which participate in solving the system via cudssExecute() must have the same settings in the corresponding cudssConfig_t

Limitations:

Communication backend underlying the cuDSS communication layer must be GPU-aware in the sense that all distributed interface APIs implementation must be accepting device memory buffers and respect the stream ordering (most of the communication layer APIs take an argument of type cudaStream_t, see the definition of cudssDistributedInterface_t).
In MGMN mode all processes must have correct matrix shapes.

In distributed case (cudssMatrixSetDistributionRow1d()) all processes must have global matrix shapes, while data buffers hold the local matrices.
MGMN mode is not supported when either CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.
MGMN mode does not support matrix batches.
All phases in MGMN mode are synchronous.

Communication layer library in cuDSS#

The purpose of the communication layer in cuDSS is to abstract away all communication primitives with a small set of only necessary operations built into a standalone shared library which is loaded in runtime when MGMN mode is enabled for cuDSS.

On platforms where MGMN mode is supported, distribution packages for cuDSS include prebuilt communication libraries libcudss_commlayer_openmpi.so and libcudss_commlayer_nccl.so for OpenMPI and NCCL, respectively. Also included are the source code of the implementation, cudss_commlayer_openmpi.cu and cudss_commlayer_nccl.cu, and a script cudss_build_commlayer.sh with an example of how a communication layer library can be built. These source files and the script are provided for demonstration purposes and can be used as a guidance for developing custom implementations.

Once a communication layer is implemented and built into a small standalone shared library, the application should enable MGMN mode for cuDSS via the steps mentioned above.

Note: if the communication layer depends on other shared libraries, such dependencies must be available at link time or runtime (depending on the communication layer implementation). For example, prebuilt for OpenMPI libcudss_commlayer_openmpi.so depends on libmpi.so.40 to be found at link time, so an application should be additionally linked to the OpenMPI shared library if it is using this communication layer for cuDSS.

Communication layer (distributed interface) API in cuDSS#

The communication layer in cuDSS can be thought of as a wrapper around all necessary communication primitives required by cuDSS MGMN mode. While user-defined implementations are supported, the distributed interface API is fixed. The API is separated into the header cudss_distributed_interface.h.

Specifically, the distributed interface API is encoded in the definition of the type cudssDistributedInterface_t and a valid communication layer implementation must define the symbol with name cudssDistributedInterface of this type. If such a symbol is not found, calling cudssSetCommLayer() will result in an error.

Multi-Threaded (MT) mode#

The idea of the multi-threaded mode is to allow using multiple CPU threads in cuDSS. For flexibility, the MT mode is enabled by abstracting away all threading-specific primitives into a small separately built shim threading layer. Users can have their own implementation of the threading layer with the threading backend of their choice (OpenMP, pthreads, TBB, etc.).

Enabling MT execution in the user application code:

Users should set the threading layer library via calling cudssSetThreadingLayer(). This can be either one of the prebuilt threading layers from the cuDSS package or a custom user-built library. In case cudssSetThreadingLayer() is called with NULL provided in place of the threading layer library name, this routine attempts to read the name from the environment variable CUDSS_THREADING_LIB.

Note: the threading layer library is set for the library handle and is used for all execution calls which involve the modified handle.

Note: since MT mode can support different (incl. user-implemented) threading layers, the limitations of using specific underlying threading backend (e.g. OpenMP or pthreads) apply.

Threading layer library in cuDSS#

The purpose of the threading layer in cuDSS is to abstract away all multi-threading primitives with a small set of only necessary operations built into a standalone shared library which is loaded in runtime when MT mode is enabled for cuDSS.

On platforms where MT mode is supported, distribution packages for cuDSS include prebuilt threading library libcudss_mtlayer_gomp.so for GNU OpenMP. Also included are the source code of the implementation, cudss_mtlayer_gomp.cu and a script cudss_build_mtlayer.sh with an example of how a threading layer library can be built. These source files and the script are provided for demonstration purposes and can be used as a guidance for developing custom implementations.

Once a threading layer is implemented and built into a small standalone shared library, the application should enable MT mode for cuDSS via the steps mentioned above.

Note: if the threading layer depends on other shared libraries, such dependencies must be available at runtime (depending on the threading layer implementation).

Threading layer API in cuDSS#

The threading layer in cuDSS can be thought of as a wrapper around all necessary threading primitives required by cuDSS MT mode. While user-defined implementations are supported, the threading interface API is fixed. The API is separated into the header cudss_threading_interface.h.

Specifically, the threading interface API is encoded in the definition of the type cudssThreadingInterface_t and a valid threading layer implementation must define the symbol with name cudssThreadingInterface of this type. If such a symbol is not found, calling cudssSetThreadingLayer() will result in an error.