Release Notes

cuDSS v0.3.0

New Features:

  • Multi-GPU multi-node (MGMN) mode with prebuilt standalone communication layers for NCCL and OpenMPI, as well as with custom user-defined GPU-aware communication backends

  • Hybrid host/device memory mode which enables keeping the factor values in the host memory (RAM) and uses only a smaller device buffer as a temporary

  • Extended support to Linux ARM(aarch64) (Ubuntu 22.04, only on Orin (SM 8.7) devices)

Breaking changes:

  • Removed values CUDSS_STATUS_ARCH_MISMATCH and CUDSS_STATUS_ZERO_PIVOT from the enum cudssStatus_t as these values will not be used

  • Renamed the main header cuDSS.h as cudss.h to better align with other CUDA math libraries

Important bug fixes:

  • Fixed execution failures when multiple tiny matrices (less than 16x16) were solved re-using the same cudssData_t

  • Fixed incorrect result when factorization phase followed a re-factorization phase with CUDSS_ALG_1 and CUDSS_ALG_2 reordering algorithms

Known issue:

Error messages are seen during cuDSS installation on RPM-based systems (RHEL, Rockey, SLES)

failed to link /usr/lib/#INSTALL_TRIPLET#/libcudss_commlayer_nccl.so -> /etc/alternatives/libcudss_commlayer_nccl.so: No such file or directory
failed to link /usr/lib/#INSTALL_TRIPLET#/libcudss_commlayer_openmpi.so -> /etc/alternatives/libcudss_commlayer_openmpi.so: No such file or directory

The installation completes despite the failure to create a couple of symlinks for cuDSS.

To fix the issue, please apply the workaround ONLY after encountering the above issue. The issue and thus workaround is only applicable to RPM-based systems (RHEL, Rockey, SLES). The workaround will drop and recreate all symlinks intended for the cudss alternatives system.

update-alternatives --remove cudss /usr/lib64/libcudss/12/libcudss.so.0

/sbin/ldconfig

update-alternatives --install /usr/lib64/libcudss.so.0 cudss /usr/lib64/libcudss/12/libcudss.so.0 120 \
--slave /usr/lib64/libcudss.so libcudss.so /usr/lib64/libcudss/12/libcudss.so \
--slave /usr/lib64/libcudss_commlayer_nccl.so  libcudss_commlayer_nccl.so /usr/lib64/libcudss/12/libcudss_commlayer_nccl.so \
--slave /usr/lib64/libcudss_commlayer_openmpi.so  libcudss_commlayer_openmpi.so /usr/lib64/libcudss/12/libcudss_commlayer_openmpi.so \
--slave /usr/lib64/libcudss_static.a libcudss_static.a /usr/lib64/libcudss/12/libcudss_static.a \
--slave /usr/lib64/cmake/cudss cudss_cmake /usr/lib64/libcudss/12/cmake/cudss \
--slave /usr/include/cudss.h cudss.h /usr/include/libcudss/12/cudss.h \
--slave /usr/include/cudss_distributed_interface.h cudss_distributed_interface.h /usr/include/libcudss/12/cudss_distributed_interface.h

/sbin/ldconfig

After the steps are completed, confirm that all the symlinks exist. Expectation:

# ls -l /usr/lib64/cmake/cudss
... /usr/lib64/cmake/cudss -> /etc/alternatives/cudss_cmake

# ls -l /usr/include/cudss*
... /usr/include/cudss.h -> /etc/alternatives/cudss.h
... /usr/include/cudss_distributed_interface.h -> /etc/alternatives/cudss_distributed_interface.h

# ls -l /usr/lib64/*cudss*
... /usr/lib64/libcudss.so -> /etc/alternatives/libcudss.so
... /usr/lib64/libcudss.so.0 -> /etc/alternatives/cudss
... /usr/lib64/libcudss_commlayer_nccl.so -> /etc/alternatives/libcudss_commlayer_nccl.so
... /usr/lib64/libcudss_commlayer_openmpi.so -> /etc/alternatives/libcudss_commlayer_openmpi.so
... /usr/lib64/libcudss_static.a -> /etc/alternatives/libcudss_static.a

# ls -l /etc/alternatives/*cudss*
... /etc/alternatives/cudss -> /usr/lib64/libcudss/12/libcudss.so.0
... /etc/alternatives/cudss.h -> /usr/include/libcudss/12/cudss.h
... /etc/alternatives/cudss_cmake -> /usr/lib64/libcudss/12/cmake/cudss
... /etc/alternatives/cudss_distributed_interface.h -> /usr/include/libcudss/12/cudss_distributed_interface.h
... /etc/alternatives/libcudss.so -> /usr/lib64/libcudss/12/libcudss.so
... /etc/alternatives/libcudss_commlayer_nccl.so -> /usr/lib64/libcudss/12/libcudss_commlayer_nccl.so
... /etc/alternatives/libcudss_commlayer_openmpi.so -> /usr/lib64/libcudss/12/libcudss_commlayer_openmpi.so
... /etc/alternatives/libcudss_static.a -> /usr/lib64/libcudss/12/libcudss_static.a

cuDSS v0.2.1

Important bug fixes:

  • Fixed host memory leaks

  • Fixed device memory bookkeeping which could cause read violation errors and segmentation faults when cuDSS is called repeatedly with the same cudssHandle_t and cudssData_t objects

  • Fixed incorrect results of iterative refinement for 1-based input matrices

  • Fixed incorrect internal temporary buffer size which could cause invalid memory accesses for small matrices

cuDSS v0.2.0

New Features:

  • Performance improvements for non-symmetric and non-hermitian matrices for the reordering algorithm CUDSS_ALG_1

  • Support for user-defined device memory allocators/memory pools

  • Support for extracting permutations which account for both reordering and pivoting (via new values CUDSS_DATA_PERM_ROW and CUDSS_DATA_PERM_COL in the enum cudssDataParam_t) for reordering algorithms CUDSS_ALG_1 and CUDSS_ALG_2

  • Extended support to all SM architectures starting with Pascal (SM 5.0)

  • Extended support to Linux ARM(SBSA) (Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, SLES 15)

Breaking changes:

  • Replaced value CUDSS_DATA_PERM_REORDER in the enum cudssDataParam_t with CUDSS_DATA_PERM_REORDER_ROW and CUDSS_DATA_PERM_REORDER_COL to separate row and column reordering permutations which can be different for non-symmetric reordering algorithm CUDSS_ALG_1

Important bug fixes:

  • Fixed incorrect solution for Hermitian matrices with non-disabled pivoting

  • Fixed sporadically incorrect solution on H100 due to shared memory allocation size

  • Fixed incorrect propagation of pivoting tolerance and epsilon from cudssConfig_t during cudssExecute()

  • Fixed sporadic hangs on GPUs with small number of SMs

cuDSS v0.1.0

New Features:

  • Initial release

  • Support for single GPU, SM architectures: SM 7.0 and newer

  • Support Linux x86-64 (Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, SLES 15)

  • Support Windows x86-64 (Windows 10, 11)

  • Support for single/double real/complex datatype for values and int datatype for indices

  • Synchronous API

Compatibility notes:

  • cuDSS requires CUDA 12.0 or above