NVIDIA cuDSS (Preview): A high-performance CUDA Library for Direct Sparse Solvers#

NVIDIA cuDSS (Preview) is a library of GPU-accelerated linear solvers with sparse matrices. It provides algorithms for solving linear systems of the following type:

\[A X = B\]

with a sparse matrix \(A\), right-hand side \(B\) and unknown solution \(X\) (could be a matrix or a vector).

The cuDSS functionality allows flexibility in matrix properties and solver configuration, as well as execution parameters like CUDA streams.

Note: Since the library is released as a preview, API is subject to change in later releases.

Download: developer.nvidia.com/cudss-downloads

Provide Feedback: cuDSS-EXTERNAL-Group@nvidia.com

Examples: cuDSS Example 1, cuDSS Example 2, cuDSS Example 3 cuDSS Example 4

Key Features and Properties#

  • Real/complex general/symmetric/positive-definite (incl. complex symmetric) sparse matrices

  • Non-uniform batching (solving multiple different systems of different sizes)

  • Uniform batching (solving multiple systems with the same sparsity pattern)

  • Single and double precision datatypes for values and int and int64_t datatypes for indices

  • Single and multiple right-hand sides

  • Multi-stage execution with three main phases: analysis (consisting of reordering and symbolic factorization), numerical factorization and solving. Optionally, it includes refactorization and solve sub-phases (forward and backward substitution with corresponding permutations and iterative refinement)

  • Different algorithms for reordering and factorization phases

  • Numerical pivoting controls

  • User-defined device memory handlers and memory pools

  • Memory estimates after the analysis phase

  • Schur complement mode

  • Hybrid host/device memory mode

  • Hybrid host/device execution mode

  • Multi-GPU multi-node (MGMN) execution with a user-definable communication layer

  • Multi-GPU (single-node) execution (MG) without a distributed communication backend

  • Multi-Threaded (MT) execution with a user-definable threading layer

  • Multi-threaded reordering for the default reordering algorithm

  • Partially asynchronous API (asynchronous for factorization and solve when host memory/execution modes are not enabled)

  • Optionally deterministic computations (bit-wise reproducibility on the same hardware, underlying software stack, and input data)

Support#

  • Supported configurations: single GPU, multi-GPU multi-node (MGMN), multi-GPU (single-node) (MG)

  • Supported SM Architectures: all SM starting with Pascal (which are supported by the corresponding CUDA Toolkit)

  • Supported OS: Linux, Windows

  • Supported CPU Architectures: x86_64, ARM (SBSA), ARM (aarch64/Jetson) for Orin and Thor devices

  • Supported communication backends (for MGMN mode): pre-built for OpenMPI 4.x and NCCL 2.x + any GPU-aware stream-aware user-defined

  • Supported threading backends (for MT mode): pre-built for GNU OpenMP (Linux) and VCOMP (Windows) + any user-defined

Index#