cuDSS General Description#

This section describes general aspects of using cuDSS API.

Error status#

All cuDSS calls return the error status cudssStatus_t.

Library handle#

In order to use cuDSS, the calling applications must always first create a cuDSS library handle by calling cudssCreate() function. Once the application finishes using the library, it must call the function cudssDestroy() to release the resources associated with the cuDSS library handle.

Memory ownership and lifetime#

cuDSS does not take ownership of data buffers provided by the user. This includes buffers used when creating cudssMatrix_t objects or user permutation or any other buffer passed via cudssDataSet() function. This means that if a buffer has been allocated outside cuDSS, it will not be deallocated by cuDSS and it is responsibility of the user to do it. Also, it is responsibility of the user to ensure that the user-provided buffers are not destroyed before cuDSS stops operating on them. Since some of cuDSS APIs (e.g. factorization and solve phases) are asynchronous, a stream synchronization via cudaStreamSynchronize() may be necessary to ensure the buffers are not in use anymore before destroying them.

cuDSS allocates host and device buffers for internal use. These buffers are kept inside the corresponding API objects like cudssHandle_t, or cudssData_t. These internal buffers are deallocated when the containing object (cudssHandle_t or cudssData_t) is destroyed during the call to cudssDestroy() or cudssDataDestroy().

cuDSS allocates host buffers when opaque objects are created via Create() functions. To deallocate those buffers, corresponding Destroy() functions must be called. Create() and Destroy() calls must always come paired with each other, otherwise memory issues may occur.

cudssMatrix_t should be thought of as a very thin wrapper around the user provided buffers.

Controlling the device buffers allocated by cuDSS internally can be done via cudssDeviceMemHandlers and associated APIs.

Memory alignment#

For user-provided buffers cuDSS does not have any additional memory alignment requirements, i.e. only the default data-type dependent alignment must be present.

Thread Safety#

Thread safety is not guaranteed for calling cuDSS from multiple host threads. For example, the library handle does internal book-keeping for device buffers allocated by the library and cannot be operated upon by multiple host threads when device buffers are being allocated. Also, updates to cudssData_t objects which happen during analysis and factorization phases are not thread-safe.

The only phase of cudssExecute() which allocates no device memory and can be executed concurrently with several host threads is the solve phase. However, in this case the host threads must operate on different cudssMatrix_t objects for the solution.

Results Reproducibility#

Currently, cuDSS is making use of atomic operations, thus bit-wise reproducibility is not guaranteed even in the repeated runs within the fixed environment. Usually such run-to-run variations should only lead to small variations in the conventional residual norms.

Parallelism with Streams#

Typically, for solving a linear system cuDSS uses both host and device compute resources. Specifically, when the hybrid host/device execution mode is disabled, reordering (a major part of the analysis phase) is executed on the host, while symbolic factorization (another part of the analysis phase), numerical factorization and solve are executed on the GPU. Thus, quite naturally the analysis phase is always synchronous currently. Factorization and solve phases are asynchronous if neither :ref:` hybrid memory <hybrid-mode-label>` nor hybrid execution modes are enabled.

As it follows, in order to use multiple streams to call cuDSS, the application should first call the analysis phase on all streams (which will not be concurrently executed), and only then call (re-)factorization and solve phases in each of streams.

Environment variables#

The following environment variables are cuDSS-specific:

Environment variable

Supported values

Description

CUDSS_LOG_LEVEL

0 - 5

Controls :ref: logging <logging-label> level

CUDSS_COMM_LIB

string

Location for the communication layer library

CUDSS_THREADING_LIB

string

Location for the threading layer library

Relation to cuSolverSp and cuSolverRf components of cuSolver library#

Sparse direct solver functionality provided by cuDSS is also partially available in the (deprecated) routines from cuSolverSp and cuSolverRf components of cuSolver library. Users of cuSolverSp and cuSolverRf are strongly recommended to switch to cuDSS. For technical details, please refer to documentation for cuSolverSp and cuSolverRf and for the transition samples.