cuDSS Advanced Features

Hybrid host/device memory mode

The main idea of the hybrid host/device memory mode is to overcome the limitation of the default (non-hybrid) mode that L and U factors (which comprise the largest part of the total device memory consumed by cuDSS) must fit into the device memory. To this purpose, the hybrid memory mode allows keeping only a (smaller) portion of L and U in the device memory at any given time moment, while keeping the entire factors only in the host memory.

While the hybrid memory mode comes with an additional cost of extra host-to-device and device-to-host memory transfers (and thus will be slower than the default mode, if the factors fit into the device memory), it can help solve larger systems which can not be processed in the default mode.

Users can enable or disable the hybrid memory mode via calling cudssConfigSet() with the setting CUDSS_CONFIG_HYBRID_MODE from the enum cudssConfigParam_t. Since using the hybrid mode implies changes to be done during the analysis phase, enabling the hybrid mode must be done before the call to cudssExecute() with CUDSS_PHASE_ANALYSIS.

Once the hybrid mode is enabled, there are several ways of using it:

  • First way is to rely on the internal heuristic of cuDSS which can determine how much device memory the hybrid mode can consume. In this case cuDSS will assume that it can use the entire GPU memory and set the device memory limit based on the device properties (which might not be the accurate estimate as driver or other applications can reserve space on the GPU).

  • Second way allows the users to have more control over the device memory consumption. The users can set the device memory limit by calling cudssConfigSet() with CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT from the enum cudssConfigParam_t. Optionally, the users can also query the minimal amount of device memory which the hybrid memory mode would need, by calling cudssDataGet() with CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN from the enum cudssDataParam_t. The device memory limit can be set after the analysis phase but must be set before the factorization phase.

    Note: CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT and CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN account for the total device memory needed for cuDSS, not just the factors.

Limitations:

  • Factors L and U (together with all necessary internal arrays) must fit into the host memory.

  • Hybrid memory mode must be enabled before the analysis phase.

  • Currently, hybrid memory mode adds extra synchronization of the CUDA stream, in all phases.

  • By default, hybrid memory mode uses cudaHostRegister()/cudaHostUnregister() (if the device supports it). As this sometimes can be slower than not using the host registered memory, there is a setting CUDSS_CONFIG_USE_CUDA_REGISTER_MEMORY to enable/disable usage of cudaHostRegister().

  • Currently, hybrid memory mode is not supported when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering, or, when CUDSS_ALG_1 is used for the factorization.

Multi-GPU multi-node (MGMN) mode

The idea of the multi-GPU multi-node (MGMN) mode is to allow using multiple GPU devices in cuDSS by means of distributed computations with multiple processes. For flexibility, the MGMN mode is enabled by abstracting away all communication-specific primitives into a small separately built shim communication layer. Users can have their own implementation of the communication layer with the communication backend of their choice (MPI, NCCL, etc.).

Since the shim communication layer abstracts away all communication-specific operations and is only loaded in runtime when MGMN mode is requested, the enabled MGMN execution in cuDSS does not require any changes for applications which do not make use of the MGMN mode.

Enabling MGMN execution in the user application code consists of two steps:

  • First, users should set the communication layer library via calling cudssSetCommLayer(). This can be either one of the prebuilt communication layers from the cuDSS package or a custom user-built library. In case cudssSetCommLayer() is called with NULL provided in place of the communication layer library name, this routine attempts to read the name from the environment variable CUDSS_COMM_LIB.

    Note: the communication layer library is set for the library handle and is used for all execution calls which involve the modified handle.

  • Second, users should set the communicator to be used by cuDSS MGMN mode via calling cudssDataSet() with CUDSS_DATA_COMM parameter name. The type of the communicator must match the underlying communication backend. E.g., if OpenMPI is used, the communicator should be the OpenMPI communicator, otherwise the crash will likely occur.

    Note: since MGMN mode can support different (incl. user-implemented) communication layers, the limitations of using specific underlying communication backend (e.g. MPI or NCCL) apply.

Limitations:

  • Communication backend underlying the cuDSS communication layer must be GPU-aware in the sense that all distributed interface APIs implementation must be accepting device memory buffers and respect the stream ordering (most of the communication layer APIs take an argument of type cudaStream_t, see the definition of cudssDistributedInterface_t).

  • MGMN mode currently only supports the case when the input matrix, right-hand side and solution vectors are not distributed and full-sized arrays are allocated on the root process (which is supposed to have rank equal to 0). Additionally, the non-root processes must have correct matrix shapes (but as it follows from above, the data buffer pointers will be ignored if passed to the matrix creation routines).

  • MGMN mode does not work together with hybrid mode.

  • MGMN mode is not supported when either CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.

Communication layer library in cuDSS

The purpose of the communication layer in cuDSS is to abstract away all communication primitives with a small set of only necessary operations built into a standalone shared library which is loaded in runtime when MGMN mode is enabled for cuDSS.

On platforms where MGMN mode is supported, distribution packages for cuDSS include prebuilt communication libraries libcudss_commlayer_openmpi.so and libcudss_commlayer_nccl.so for OpenMPI and NCCL, respectively. Also included are the source code of the implementation, cudss_commlayer_openmpi.cu and cudss_commlayer_nccl.cu, and a script cudss_build_commlayer.sh with an example of how a communication layer library can be built. These source files and the script are provided for demonstration purposes and can be used as a guidance for developing custom implementations.

Once a communication layer is implemented and built into a small standalone shared library, the application should enable MGMN mode for cuDSS via the steps mentioned above.

Note: if the communication layer depends on other shared libraries, such dependencies must be available at link time or runtime (depending on the communication layer implementation). For example, prebuilt for OpenMPI libcudss_commlayer_openmpi.so depends on libopenmpi.so to be found at link time, so an application should be additionally linked to the OpenMPI shared library if it is using this communication layer for cuDSS.

Communication layer (distributed interface) API in cuDSS

The communication layer in cuDSS can be thought of as a wrapper around all necessary communication primitives required by cuDSS MGMN mode. While user-defined implementations are supported, the distributed interface API is fixed. The API is separated into the header cudss_distributed_interface.h.

Specifically, the distributed interface API is encoded in the definition of the type cudssDistributedInterface_t and a valid communication layer implementation must define the symbol with name cudssDistributedInterface of this type. If such a symbol is not found, calling cudssSetCommLayer() will result in an error.