Release Notes#
cuDSS v0.5.0#
New Features:
Multi-Threaded (MT) mode with prebuilt standalone threading layer for GNU OpenMP on Linux and VCOMP (
VCOMP140.dll
) on Windows, as well as with custom user-defined threading backendsMulti-threaded reordering for
CUDSS_ALG_DEFAULT
reordering algorithmHybrid host/device execute mode to enable using the host on factorization and solve (improved performance for small matrices)
Improved performance and memory requirements for the hybrid memory mode
New configuration option
CUDSS_CONFIG_PIVOT_EPSILON_ALG
to enable scaling of pivoting epsilon (static local pivoting only)Blackwell GPU architecture support (
sm_100
,sm_120
,sm_101
)
Breaking changes:
Changed the function signature for
cudssMatrixCreateBatchDn()
andcudssMatrixGetBatchDn()
to have an extra argument for the integer types of the arrays with scalar parameters of the batch (likenrows
,ncols
orld
)Changed the definition of
cudssMatrixFormat_t
andcudssMatrixGetFormat()
to make the matrix format enum values become bit flags (which can be combined) and added a flagCUDSS_MFORMAT_BATCH
for querying whether acudssMatrix_t
is a batchDropped support for
SLES 15.x
forx <= 5
(upgraded minimal version ofSLES
to15.6
) and upgraded the minimal version ofGLIBC
supported to 2.28
Important bug fixes:
Fixed a bug when the internally symmetrized matrix pattern has number of nonzeros overflowing 32-bit integer maximum value (CUDSS-417).
Fixed a bug when the MGMN mode produced incorrect results for some of the larger matrices (CUDSS-601)
Fixed incorrect memory estimates returned for the hybrid memory mode (CUDSS-669)
cuDSS v0.4.0#
New Features:
Significant performance improvements for all reordering algorithms other than
CUDSS_ALG_1
andCUDSS_ALG_2
Added support for non-uniform batching (solving multiple systems with different matrices and righthand sides) via new APIs like
cudssMatrixCreateBatch<Dn|Csr>()
and others similar to existing APIs for non-batch matrix objects and extendingcudssExecute()
to support batchesAdded support for querying memory estimates via
cudssDataGet()
withCUDSS_DATA_MEMORY_ESTIMATES
Added installer support for
Ubuntu 24.04
Added support for pip wheels on pypi.org and pypi.nvidia.com and for conda packaging
Breaking changes:
Added a new value
CUDSS_DATA_MEMORY_ESTIMATES
into thecudssDataParam_t
enumStarting with this release, cuDSS has a runtime dependency on cuBLAS (usually available as a part of CUDA Toolkit)
Important bug fixes:
Added CMake config version file and fixed CMake config for system-wide installation
Known issues:
For correct system-wide installation, cudss-config.cmake makes use of
REAL_PATH
which is only available since cmake 3.19 (which is newer than the default cmake version forUbuntu 20.04
, e.g.)The MGMN mode of cuDSS with OpenMPI communication layer might run out of GPU memory due to a bug in OpenMPI/UCX. If you encounter this issue, consider setting
export UCX_MEMTYPE_CACHE=n
orexport UCX_TLS=^cuda_ipc
or switching to the NCCL communication backend as a workaround. These workarounds might lead to performance degradation. In such a case, please report it.PIP wheels for Linux + x86_64 with version 0.4.0.2 might not work on some of older OSes due to a bug in patchelf 0.18.0. On affected systems,
ldd libcudss.so.0
will return an errornot a dynamic executable
and using the shared library for linking with an application will produce an errorELF load command address/offset not properly aligned
. The suggested workaround is to use the installation commandpip install nvidia-cudss-cu12
which will install the patched wheelsnvidia-cudss-cu12 0.4.0.2.post1
.
cuDSS v0.3.0#
New Features:
Multi-GPU multi-node (MGMN) mode with prebuilt standalone communication layers for NCCL and OpenMPI, as well as with custom user-defined GPU-aware communication backends
Hybrid host/device memory mode which enables keeping the factor values in the host memory (RAM) and uses only a smaller device buffer as a temporary
Extended support to
Linux ARM(aarch64)
(Ubuntu 22.04, only on Orin (SM 8.7
) devices)
Breaking changes:
Removed values
CUDSS_STATUS_ARCH_MISMATCH
andCUDSS_STATUS_ZERO_PIVOT
from the enumcudssStatus_t
as these values will not be usedRenamed the main header
cuDSS.h
ascudss.h
to better align with other CUDA math libraries
Important bug fixes:
Fixed execution failures when multiple tiny matrices (less than 16x16) were solved re-using the same
cudssData_t
Fixed incorrect result when factorization phase followed a re-factorization phase with
CUDSS_ALG_1
andCUDSS_ALG_2
reordering algorithms
Known issue:
Error messages are seen during cuDSS installation on RPM-based systems (RHEL, Rocky, SLES)
failed to link /usr/lib/#INSTALL_TRIPLET#/libcudss_commlayer_nccl.so -> /etc/alternatives/libcudss_commlayer_nccl.so: No such file or directory
failed to link /usr/lib/#INSTALL_TRIPLET#/libcudss_commlayer_openmpi.so -> /etc/alternatives/libcudss_commlayer_openmpi.so: No such file or directory
The installation completes despite the failure to create a couple of symlinks for cuDSS.
To fix the issue, please apply the workaround ONLY after encountering the above issue. The issue and thus workaround is only applicable to RPM-based systems (RHEL, Rocky, SLES). The workaround will drop and recreate all symlinks intended for the cudss alternatives system.
update-alternatives --remove cudss /usr/lib64/libcudss/12/libcudss.so.0
/sbin/ldconfig
update-alternatives --install /usr/lib64/libcudss.so.0 cudss /usr/lib64/libcudss/12/libcudss.so.0 120 \
--slave /usr/lib64/libcudss.so libcudss.so /usr/lib64/libcudss/12/libcudss.so \
--slave /usr/lib64/libcudss_commlayer_nccl.so libcudss_commlayer_nccl.so /usr/lib64/libcudss/12/libcudss_commlayer_nccl.so \
--slave /usr/lib64/libcudss_commlayer_openmpi.so libcudss_commlayer_openmpi.so /usr/lib64/libcudss/12/libcudss_commlayer_openmpi.so \
--slave /usr/lib64/libcudss_static.a libcudss_static.a /usr/lib64/libcudss/12/libcudss_static.a \
--slave /usr/lib64/cmake/cudss cudss_cmake /usr/lib64/libcudss/12/cmake/cudss \
--slave /usr/include/cudss.h cudss.h /usr/include/libcudss/12/cudss.h \
--slave /usr/include/cudss_distributed_interface.h cudss_distributed_interface.h /usr/include/libcudss/12/cudss_distributed_interface.h
/sbin/ldconfig
After the steps are completed, confirm that all the symlinks exist. Expectation:
# ls -l /usr/lib64/cmake/cudss
... /usr/lib64/cmake/cudss -> /etc/alternatives/cudss_cmake
# ls -l /usr/include/cudss*
... /usr/include/cudss.h -> /etc/alternatives/cudss.h
... /usr/include/cudss_distributed_interface.h -> /etc/alternatives/cudss_distributed_interface.h
# ls -l /usr/lib64/*cudss*
... /usr/lib64/libcudss.so -> /etc/alternatives/libcudss.so
... /usr/lib64/libcudss.so.0 -> /etc/alternatives/cudss
... /usr/lib64/libcudss_commlayer_nccl.so -> /etc/alternatives/libcudss_commlayer_nccl.so
... /usr/lib64/libcudss_commlayer_openmpi.so -> /etc/alternatives/libcudss_commlayer_openmpi.so
... /usr/lib64/libcudss_static.a -> /etc/alternatives/libcudss_static.a
# ls -l /etc/alternatives/*cudss*
... /etc/alternatives/cudss -> /usr/lib64/libcudss/12/libcudss.so.0
... /etc/alternatives/cudss.h -> /usr/include/libcudss/12/cudss.h
... /etc/alternatives/cudss_cmake -> /usr/lib64/libcudss/12/cmake/cudss
... /etc/alternatives/cudss_distributed_interface.h -> /usr/include/libcudss/12/cudss_distributed_interface.h
... /etc/alternatives/libcudss.so -> /usr/lib64/libcudss/12/libcudss.so
... /etc/alternatives/libcudss_commlayer_nccl.so -> /usr/lib64/libcudss/12/libcudss_commlayer_nccl.so
... /etc/alternatives/libcudss_commlayer_openmpi.so -> /usr/lib64/libcudss/12/libcudss_commlayer_openmpi.so
... /etc/alternatives/libcudss_static.a -> /usr/lib64/libcudss/12/libcudss_static.a
cuDSS v0.2.1#
Important bug fixes:
Fixed host memory leaks
Fixed device memory bookkeeping which could cause read violation errors and segmentation faults when cuDSS is called repeatedly with the same cudssHandle_t and cudssData_t objects
Fixed incorrect results of iterative refinement for 1-based input matrices
Fixed incorrect internal temporary buffer size which could cause invalid memory accesses for small matrices
cuDSS v0.2.0#
New Features:
Performance improvements for non-symmetric and non-hermitian matrices for the reordering algorithm
CUDSS_ALG_1
Support for user-defined device memory allocators/memory pools
Support for extracting permutations which account for both reordering and pivoting (via new values
CUDSS_DATA_PERM_ROW
andCUDSS_DATA_PERM_COL
in the enumcudssDataParam_t
) for reordering algorithmsCUDSS_ALG_1
andCUDSS_ALG_2
Extended support to all SM architectures starting with Pascal (
SM 5.0
)Extended support to
Linux ARM(SBSA)
(Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, SLES 15)
Breaking changes:
Replaced value
CUDSS_DATA_PERM_REORDER
in the enumcudssDataParam_t
withCUDSS_DATA_PERM_REORDER_ROW
andCUDSS_DATA_PERM_REORDER_COL
to separate row and column reordering permutations which can be different for non-symmetric reordering algorithmCUDSS_ALG_1
Important bug fixes:
Fixed incorrect solution for Hermitian matrices with non-disabled pivoting
Fixed sporadically incorrect solution on H100 due to shared memory allocation size
Fixed incorrect propagation of pivoting tolerance and epsilon from
cudssConfig_t
duringcudssExecute()
Fixed sporadic hangs on GPUs with small number of SMs
cuDSS v0.1.0#
New Features:
Initial release
Support for single GPU, SM architectures:
SM 7.0
and newerSupport
Linux x86-64
(Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, SLES 15)Support
Windows x86-64
(Windows 10, 11)Support for single/double real/complex datatype for values and int datatype for indices
Synchronous API
Compatibility notes:
cuDSS requires CUDA 12.0 or above