Release Notes#

cuDSS v0.5.0#

New Features:

Multi-Threaded (MT) mode with prebuilt standalone threading layer for GNU OpenMP on Linux and VCOMP (VCOMP140.dll) on Windows, as well as with custom user-defined threading backends
Multi-threaded reordering for CUDSS_ALG_DEFAULT reordering algorithm
Hybrid host/device execute mode to enable using the host on factorization and solve (improved performance for small matrices)
Improved performance and memory requirements for the hybrid memory mode
New configuration option CUDSS_CONFIG_PIVOT_EPSILON_ALG to enable scaling of pivoting epsilon (static local pivoting only)
Blackwell GPU architecture support (sm_100, sm_120, sm_101)

Breaking changes:

Changed the function signature for cudssMatrixCreateBatchDn() and cudssMatrixGetBatchDn() to have an extra argument for the integer types of the arrays with scalar parameters of the batch (like nrows, ncols or ld)
Changed the definition of cudssMatrixFormat_t and cudssMatrixGetFormat() to make the matrix format enum values become bit flags (which can be combined) and added a flag CUDSS_MFORMAT_BATCH for querying whether a cudssMatrix_t is a batch
Dropped support for SLES 15.x for x <= 5 (upgraded minimal version of SLES to 15.6) and upgraded the minimal version of GLIBC supported to 2.28

Important bug fixes:

Fixed a bug when the internally symmetrized matrix pattern has number of nonzeros overflowing 32-bit integer maximum value (CUDSS-417).
Fixed a bug when the MGMN mode produced incorrect results for some of the larger matrices (CUDSS-601)
Fixed incorrect memory estimates returned for the hybrid memory mode (CUDSS-669)

cuDSS v0.4.0#

New Features:

Significant performance improvements for all reordering algorithms other than CUDSS_ALG_1 and CUDSS_ALG_2
Added support for non-uniform batching (solving multiple systems with different matrices and righthand sides) via new APIs like cudssMatrixCreateBatch<Dn|Csr>() and others similar to existing APIs for non-batch matrix objects and extending cudssExecute() to support batches
Added support for querying memory estimates via cudssDataGet() with CUDSS_DATA_MEMORY_ESTIMATES
Added installer support for Ubuntu 24.04
Added support for pip wheels on pypi.org and pypi.nvidia.com and for conda packaging

Breaking changes:

Added a new value CUDSS_DATA_MEMORY_ESTIMATES into the cudssDataParam_t enum
Starting with this release, cuDSS has a runtime dependency on cuBLAS (usually available as a part of CUDA Toolkit)

Important bug fixes:

Added CMake config version file and fixed CMake config for system-wide installation

Known issues:

For correct system-wide installation, cudss-config.cmake makes use of REAL_PATH which is only available since cmake 3.19 (which is newer than the default cmake version for Ubuntu 20.04, e.g.)
The MGMN mode of cuDSS with OpenMPI communication layer might run out of GPU memory due to a bug in OpenMPI/UCX. If you encounter this issue, consider setting export UCX_MEMTYPE_CACHE=n or export UCX_TLS=^cuda_ipc or switching to the NCCL communication backend as a workaround. These workarounds might lead to performance degradation. In such a case, please report it.
PIP wheels for Linux + x86_64 with version 0.4.0.2 might not work on some of older OSes due to a bug in patchelf 0.18.0. On affected systems, ldd libcudss.so.0 will return an error not a dynamic executable and using the shared library for linking with an application will produce an error ELF load command address/offset not properly aligned. The suggested workaround is to use the installation command pip install nvidia-cudss-cu12 which will install the patched wheels nvidia-cudss-cu12 0.4.0.2.post1.

cuDSS v0.3.0#

New Features:

Multi-GPU multi-node (MGMN) mode with prebuilt standalone communication layers for NCCL and OpenMPI, as well as with custom user-defined GPU-aware communication backends
Hybrid host/device memory mode which enables keeping the factor values in the host memory (RAM) and uses only a smaller device buffer as a temporary
Extended support to Linux ARM(aarch64) (Ubuntu 22.04, only on Orin (SM 8.7) devices)

Breaking changes:

Removed values CUDSS_STATUS_ARCH_MISMATCH and CUDSS_STATUS_ZERO_PIVOT from the enum cudssStatus_t as these values will not be used
Renamed the main header cuDSS.h as cudss.h to better align with other CUDA math libraries

Important bug fixes:

Fixed execution failures when multiple tiny matrices (less than 16x16) were solved re-using the same cudssData_t
Fixed incorrect result when factorization phase followed a re-factorization phase with CUDSS_ALG_1 and CUDSS_ALG_2 reordering algorithms

Known issue:

Error messages are seen during cuDSS installation on RPM-based systems (RHEL, Rocky, SLES)

failed to link /usr/lib/#INSTALL_TRIPLET#/libcudss_commlayer_nccl.so -> /etc/alternatives/libcudss_commlayer_nccl.so: No such file or directory
failed to link /usr/lib/#INSTALL_TRIPLET#/libcudss_commlayer_openmpi.so -> /etc/alternatives/libcudss_commlayer_openmpi.so: No such file or directory

The installation completes despite the failure to create a couple of symlinks for cuDSS.

To fix the issue, please apply the workaround ONLY after encountering the above issue. The issue and thus workaround is only applicable to RPM-based systems (RHEL, Rocky, SLES). The workaround will drop and recreate all symlinks intended for the cudss alternatives system.

update-alternatives --remove cudss /usr/lib64/libcudss/12/libcudss.so.0

/sbin/ldconfig

update-alternatives --install /usr/lib64/libcudss.so.0 cudss /usr/lib64/libcudss/12/libcudss.so.0 120 \
--slave /usr/lib64/libcudss.so libcudss.so /usr/lib64/libcudss/12/libcudss.so \
--slave /usr/lib64/libcudss_commlayer_nccl.so  libcudss_commlayer_nccl.so /usr/lib64/libcudss/12/libcudss_commlayer_nccl.so \
--slave /usr/lib64/libcudss_commlayer_openmpi.so  libcudss_commlayer_openmpi.so /usr/lib64/libcudss/12/libcudss_commlayer_openmpi.so \
--slave /usr/lib64/libcudss_static.a libcudss_static.a /usr/lib64/libcudss/12/libcudss_static.a \
--slave /usr/lib64/cmake/cudss cudss_cmake /usr/lib64/libcudss/12/cmake/cudss \
--slave /usr/include/cudss.h cudss.h /usr/include/libcudss/12/cudss.h \
--slave /usr/include/cudss_distributed_interface.h cudss_distributed_interface.h /usr/include/libcudss/12/cudss_distributed_interface.h

/sbin/ldconfig

After the steps are completed, confirm that all the symlinks exist. Expectation:

# ls -l /usr/lib64/cmake/cudss
... /usr/lib64/cmake/cudss -> /etc/alternatives/cudss_cmake

# ls -l /usr/include/cudss*
... /usr/include/cudss.h -> /etc/alternatives/cudss.h
... /usr/include/cudss_distributed_interface.h -> /etc/alternatives/cudss_distributed_interface.h

# ls -l /usr/lib64/*cudss*
... /usr/lib64/libcudss.so -> /etc/alternatives/libcudss.so
... /usr/lib64/libcudss.so.0 -> /etc/alternatives/cudss
... /usr/lib64/libcudss_commlayer_nccl.so -> /etc/alternatives/libcudss_commlayer_nccl.so
... /usr/lib64/libcudss_commlayer_openmpi.so -> /etc/alternatives/libcudss_commlayer_openmpi.so
... /usr/lib64/libcudss_static.a -> /etc/alternatives/libcudss_static.a

# ls -l /etc/alternatives/*cudss*
... /etc/alternatives/cudss -> /usr/lib64/libcudss/12/libcudss.so.0
... /etc/alternatives/cudss.h -> /usr/include/libcudss/12/cudss.h
... /etc/alternatives/cudss_cmake -> /usr/lib64/libcudss/12/cmake/cudss
... /etc/alternatives/cudss_distributed_interface.h -> /usr/include/libcudss/12/cudss_distributed_interface.h
... /etc/alternatives/libcudss.so -> /usr/lib64/libcudss/12/libcudss.so
... /etc/alternatives/libcudss_commlayer_nccl.so -> /usr/lib64/libcudss/12/libcudss_commlayer_nccl.so
... /etc/alternatives/libcudss_commlayer_openmpi.so -> /usr/lib64/libcudss/12/libcudss_commlayer_openmpi.so
... /etc/alternatives/libcudss_static.a -> /usr/lib64/libcudss/12/libcudss_static.a

cuDSS v0.2.1#

Important bug fixes:

Fixed host memory leaks
Fixed device memory bookkeeping which could cause read violation errors and segmentation faults when cuDSS is called repeatedly with the same cudssHandle_t and cudssData_t objects
Fixed incorrect results of iterative refinement for 1-based input matrices
Fixed incorrect internal temporary buffer size which could cause invalid memory accesses for small matrices

cuDSS v0.2.0#

New Features:

Performance improvements for non-symmetric and non-hermitian matrices for the reordering algorithm CUDSS_ALG_1
Support for user-defined device memory allocators/memory pools
Support for extracting permutations which account for both reordering and pivoting (via new values CUDSS_DATA_PERM_ROW and CUDSS_DATA_PERM_COL in the enum cudssDataParam_t) for reordering algorithms CUDSS_ALG_1 and CUDSS_ALG_2
Extended support to all SM architectures starting with Pascal (SM 5.0)
Extended support to Linux ARM(SBSA) (Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, SLES 15)

Breaking changes:

Replaced value CUDSS_DATA_PERM_REORDER in the enum cudssDataParam_t with CUDSS_DATA_PERM_REORDER_ROW and CUDSS_DATA_PERM_REORDER_COL to separate row and column reordering permutations which can be different for non-symmetric reordering algorithm CUDSS_ALG_1

Important bug fixes:

Fixed incorrect solution for Hermitian matrices with non-disabled pivoting
Fixed sporadically incorrect solution on H100 due to shared memory allocation size
Fixed incorrect propagation of pivoting tolerance and epsilon from cudssConfig_t during cudssExecute()
Fixed sporadic hangs on GPUs with small number of SMs

cuDSS v0.1.0#

New Features:

Initial release
Support for single GPU, SM architectures: SM 7.0 and newer
Support Linux x86-64 (Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, SLES 15)
Support Windows x86-64 (Windows 10, 11)
Support for single/double real/complex datatype for values and int datatype for indices
Synchronous API

Compatibility notes:

cuDSS requires CUDA 12.0 or above