cuDSS General Description#
This section describes general aspects of using cuDSS API.
Error status#
All cuDSS calls return the error status cudssStatus_t.
Library handle#
In order to use cuDSS, the calling applications must always first create a cuDSS
library handle by calling cudssCreate() function or
cudssCreateMg() in case of multiple devices.
Once the application finishes using the library, it must call the function
cudssDestroy() to release the resources
associated with the cuDSS library handle.
Memory ownership and lifetime#
cuDSS does not take ownership of data buffers provided by the user. This includes
buffers used when creating cudssMatrix_t objects or user permutation or any
other buffer passed via cudssDataSet() function.
Any user provided buffer passed via cudssDataSet(),
like the user permutation, will be copied to an internally managed memory buffer.
After the call to cudssDataSet() returns, cuDSS does
not require access to the provided buffer and always operates on the internal copy.
In general, if a buffer has been allocated outside of cuDSS, it will not be
deallocated by cuDSS and it is the responsibility of the user to do it. Also,
it is the responsibility of the user to ensure that
the user-provided buffers, such as matrix and vector data, are not destroyed
before cuDSS stops operating on them.
Since some of cuDSS APIs (e.g. factorization and solve phases) are asynchronous,
a stream synchronization via cudaStreamSynchronize() may be necessary to ensure
the buffers are not in use anymore before destroying them.
cuDSS allocates host and device buffers for internal use. These buffers are kept
inside the corresponding API objects like cudssHandle_t,
or cudssData_t. These internal buffers are deallocated
when the containing object (cudssHandle_t or cudssData_t) is destroyed during the call
to cudssDestroy() or cudssDataDestroy().
cuDSS allocates host buffers when opaque objects are created via Create() functions.
To deallocate those buffers, corresponding Destroy() functions must be called.
A Create() call must always be paired with a Destroy() call to prevent memory
issues.
cudssMatrix_t should be thought of as a very thin wrapper
around the user provided buffers.
Controlling the device buffers allocated by cuDSS internally can be done via
cudssDeviceMemHandler and associated APIs.
Memory alignment#
For user-provided buffers, cuDSS does not have any additional memory alignment requirements, i.e., only the default data-type dependent alignment must be present.
Thread Safety#
Thread safety is not guaranteed for calling cuDSS from multiple host threads.
For example, the cudssHandle_t does internal
book-keeping for device buffers allocated by the library and cannot be operated
upon by multiple host threads when device buffers are being allocated. Also,
updates to cudssData_t objects which happen during analysis and factorization
phases are not thread-safe.
The only phase of cudssExecute() which allocates no device memory and can be
executed concurrently with several host threads is the solve phase. However, in
this case the host threads must operate on different cudssMatrix_t objects
for the solution.
Results Reproducibility#
By default, cuDSS makes use of atomic operations which implies that bit-wise reproducibility is not guaranteed even in the repeated runs within the fixed environment. Usually such run-to-run variations should only lead to small variations in the conventional residual norms.
To enable bit-wise reproducibility in cuDSS at every run when executed on GPUs with the same architecture
and the same number of SMs (assuming the input data and solver settings are also bit-wise identical),
one can use the setting CUDSS_CONFIG_DETERMINISTIC_MODE.
Note: deterministic mode uses a different set of kernels which often might be slower than the kernels used in the default mode.
Parallelism with Streams#
Typically, for solving a linear system, cuDSS uses both host and device compute resources. Specifically, when the hybrid host/device execution mode is disabled, reordering (a major part of the analysis phase) is executed on the host, while symbolic factorization (another part of the analysis phase), numerical factorization and solve are executed on the GPU. Thus, quite naturally the analysis phase is always synchronous currently. Factorization and solve phases are asynchronous if neither :ref:` hybrid memory <hybrid-mode-label>` nor hybrid execution modes are enabled.
As it follows, in order to use multiple streams to call cuDSS, the application should first call the analysis phase on all streams (which will not be concurrently executed), and only then call (re-)factorization and solve phases in each of streams.
CUDA Graphs support#
There are restrictions on using CUDA Graphs with cuDSS.
By default, cuDSS uses cudaMalloc/cudaFree for device memory allocations which is not compatible with CUDA graphs. To remove this limitation, one should use cudssDeviceMemHandler and associated APIs so that cuDSS uses cudaMallocAsync/cudaFreeAsync for device memory allocations.
Analysis phase is always synchronous.
When certain features are enabled, cuDSS factorization and solve phases may also become synchronous. For example, this happens when hybrid memory mode, hybrid execution mode or MGMN mode are enabled.
By default, however, factorization and solve phases are asynchronous and thus can be captured in a CUDA Graph.
Environment variables#
The following environment variables are cuDSS-specific:
Environment variable
Supported values
Description
CUDSS_LOG_LEVEL
0 - 5Controls logging level
CUDSS_COMM_LIB
stringLocation for the communication layer library
CUDSS_THREADING_LIB
stringLocation for the threading layer library
Relation to cuSolverSp and cuSolverRf components of cuSolver library#
Sparse direct solver functionality provided by cuDSS is also partially available in the (deprecated) routines from cuSolverSp and cuSolverRf components of cuSolver library. Users of cuSolverSp and cuSolverRf are strongly recommended to switch to cuDSS. For technical details, please refer to documentation for cuSolverSp and cuSolverRf and for the transition samples.