cuDSS General Description#
This section describes general aspects of using cuDSS API.
Error status#
All cuDSS calls return the error status cudssStatus_t.
Library handle#
In order to use cuDSS, the calling applications must always first create a cuDSS library handle by calling cudssCreate() function. Once the application finishes using the library, it must call the function cudssDestroy() to release the resources associated with the cuDSS library handle.
Memory ownership and lifetime#
cuDSS does not take ownership of data buffers provided by the user. This includes
buffers used when creating cudssMatrix_t
objects or user permutation or any
other buffer passed via cudssDataSet() function.
This means that if a buffer has been allocated outside cuDSS, it will not be
deallocated by cuDSS and it is responsibility of the user to do it. Also,
it is responsibility of the user to ensure that
the user-provided buffers are not destroyed before cuDSS stops operating on them.
Since some of cuDSS APIs (e.g. factorization and solve phases) are asynchronous,
a stream synchronization via cudaStreamSynchronize()
may be necessary to ensure
the buffers are not in use anymore before destroying them.
cuDSS allocates host and device buffers for internal use. These buffers are kept
inside the corresponding API objects like cudssHandle_t,
or cudssData_t. These internal buffers are deallocated
when the containing object (cudssHandle_t
or cudssData_t
) is destroyed during the call
to cudssDestroy() or cudssDataDestroy().
cuDSS allocates host buffers when opaque objects are created via Create()
functions.
To deallocate those buffers, corresponding Destroy()
functions must be called.
Create()
and Destroy()
calls must always come paired with each other, otherwise
memory issues may occur.
cudssMatrix_t should be thought of as a very thin wrapper around the user provided buffers.
Controlling the device buffers allocated by cuDSS internally can be done via cudssDeviceMemHandlers and associated APIs.
Memory alignment#
For user-provided buffers cuDSS does not have any additional memory alignment requirements, i.e. only the default data-type dependent alignment must be present.
Thread Safety#
Thread safety is not guaranteed for calling cuDSS from multiple host threads.
For example, the library handle does internal
book-keeping for device buffers allocated by the library and cannot be operated
upon by multiple host threads when device buffers are being allocated. Also,
updates to cudssData_t
objects which happen during analysis and factorization
phases are not thread-safe.
The only phase of cudssExecute()
which allocates no device memory and can be
executed concurrently with several host threads is the solve phase. However, in
this case the host threads must operate on different cudssMatrix_t
objects
for the solution.
Results Reproducibility#
Currently, cuDSS is making use of atomic operations, thus bit-wise reproducibility is not guaranteed even in the repeated runs within the fixed environment. Usually such run-to-run variations should only lead to small variations in the conventional residual norms.
Parallelism with Streams#
Typically, for solving a linear system cuDSS uses both host and device compute resources. Specifically, when the hybrid host/device execution mode is disabled, reordering (a major part of the analysis phase) is executed on the host, while symbolic factorization (another part of the analysis phase), numerical factorization and solve are executed on the GPU. Thus, quite naturally the analysis phase is always synchronous currently. Factorization and solve phases are asynchronous if neither :ref:` hybrid memory <hybrid-mode-label>` nor hybrid execution modes are enabled.
As it follows, in order to use multiple streams to call cuDSS, the application should first call the analysis phase on all streams (which will not be concurrently executed), and only then call (re-)factorization and solve phases in each of streams.
Environment variables#
The following environment variables are cuDSS-specific:
Environment variable
Supported values
Description
CUDSS_LOG_LEVEL
0 - 5
Controls :ref:
logging <logging-label>
level
CUDSS_COMM_LIB
string
Location for the communication layer library
CUDSS_THREADING_LIB
string
Location for the threading layer library
Relation to cuSolverSp and cuSolverRf components of cuSolver library#
Sparse direct solver functionality provided by cuDSS is also partially available in the (deprecated) routines from cuSolverSp and cuSolverRf components of cuSolver library. Users of cuSolverSp and cuSolverRf are strongly recommended to switch to cuDSS. For technical details, please refer to documentation for cuSolverSp and cuSolverRf and for the transition samples.