Release Notes#
cuDSS v0.7.1#
New Features:
Enabled
CUDSS_DATA_USER_ELIMINATION_TREE
: It can be used in combination withCUDSS_DATA_USER_PERM
to set both the user permutation and elimination tree with values from a previous run, maintaining the factorization and solve performance while improving the reordering time.
Important bug fixes:
Fixed performance regression in the reordering phase with the default reordering algorithm for matrices with block structure (also known as BSR) (CUDSS-1035).
Known issues:
Changing the data type of the matrix while reusing an existing
cudssHandle_t
will not lead to correct results and might crash (CUDSS-1019).Matrices with more than \(2^{31}\) rows or columns are not supported.
compute-sanitizer might report spurious uninitialized read on Windows in multi-GPU (MG) mode (CUDSS-1026). Please report if observed.
On Linux, depending on the version of GNU OpenMP runtime, valgrind and sanitizer tools may crash or report a memory leak during the call to
cudssDestroy()
if multi-threaded mode is used (CUDSS-1079). This happens when cuDSS is callingdlclose()
for the multi-threading layer library and is caused by GNU OpenMP runtime unloading. As a workaround, setGOMP_SPINCOUNT
to0
in the environment or add a call todlopen()
for thelibgomp.so
shared library in the application (before callingcudssDestroy()
).Reordering algorithm
CUDSS_ALG_3
may crash on very small matrices (CUDSS-1030)Separate calls to
CUDSS_PHASE_REORDERING
andCUDSS_PHASE_SYMBOLIC_FACTORIZATION
with matrix types other thanCUDSS_MTYPE_GENERAL
with matching enabled produce incorrect results (CUDSS-1070). As a workaround, useCUDSS_PHASE_ANALYSIS
.
cuDSS v0.7.0#
New Features:
Enabled multi-GPU single-node (MG) mode via
cudssCreateMg()
andCUDSS_CONFIG_DEVICE_COUNT
andCUDSS_CONFIG_DEVICE_INDICES
Re-introduced factorization algorithm
CUDSS_ALG_1
with improved performance for matrices with very sparse factorsAdded support for
int64_t
indexing for sparse matricesEnabled turning on and off superpanel optimization (
CUDSS_CONFIG_USE_SUPERPANELS
)Enabled extracting the auxiliary elimination tree structure (
CUDSS_DATA_ELIMINATION_TREE
)Added support for solve sub-phases (
CUDSS_PHASE_SOLVE_FWD
,CUDSS_PHASE_SOLVE_FWD_PERM
,CUDSS_PHASE_SOLVE_BWD
,CUDSS_PHASE_SOLVE_BWD_PERM
)Added support for Schur complement mode (
CUDSS_CONFIG_SCHUR_MODE
,CUDSS_DATA_USER_SCHUR_INDICES
,CUDSS_DATA_SCHUR_SHAPE
,CUDSS_DATA_SCHUR_MATRIX
)Added support for deterministic mode (
CUDSS_CONFIG_DETERMINISTIC_MODE
)Enabled user interruption on the host (
CUDSS_DATA_USER_HOST_INTERRUPT
)Improved performance for the symbolic factorization phase with
CUDSS_ALG_DEFAULT
andCUDSS_ALG_3
reordering algorithmsEnabled support for CUDA 13 and new Blackwell GPU architectures (incl. GB300 and DGX Spark)
Breaking changes:
All user provided data (provided with the prefix
CUDSS_DATA_USER_
) is now copied to an internal buffer with acudssDataSet
operation and copied back with acudssDataGet
operation. This breaks the call ofcudssDataGet
withCUDSS_DATA_USER_PERM
as it now copies the data to the provided pointer instead of copying the pointer.
Important bug fixes:
Fixed incorrect results for the default matching algorithm for hybrid memory mode and hybrid execute mode, when the scaling vectors are non-constant (CUDSS-882).
Fixed the crash when in MGMN mode some of the ranks have empty local parts of the matrix (CUDSS-913).
Fixed out-of-bounds access during reordering for very large matrices (CUDSS-948).
Fixed incorrect results for symmetric matrices with superpanel and pivoting options enabled (CUDSS-1003).
Known issues:
Setting
CUDSS_DATA_USER_ELIMINATION_TREE
is currently not supported and leads to undefined behavior (CUDSS-1020).Changing the data type of the matrix while reusing an existing
cudssHandle_t
will not lead to correct results and might crash (CUDSS-1019)Matrices with more than \(2^{31}\) rows or columns are not supported.
compute-sanitizer might report spurious uninitialized read on Windows in multi_GPU (MG) mode (CUDSS-1026)
Performance regression on reordering phase with the default reordering algorithm for matrices with block structure (aka BSR) (CUDSS-1035)
cuDSS v0.6.0#
New Features:
Enabled support for distributed matrices with a row-wise potentially overlapping distribution via
cudssMatrixSetDistributionRow1d()
Enabled support for matching and scaling with
CUDSS_CONFIG_USE_MATCHING
andCUDSS_CONFIG_MATCHING_ALG
Enabled support for solving uniform batches (same sparsity pattern) of linear systems via
CUDSS_CONFIG_UBATCH_SIZE
andCUDSS_CONFIG_UBATCH_INDEX
Extended support for the hybrid memory mode from default (single GPU) to MGMN mode
Enabled support for combining the phases for
cudssExecute()
in a single call and added separate phasesCUDSS_PHASE_REORDERING
andCUDSS_PHASE_SYMBOLIC_FACTORIZATION
for reordering and symbolic factorizationEnabled a new configuration parameter
CUDSS_CONFIG_ND_NLEVELS
which can control the nested dissection depth for reordering withCUDSS_ALG_DEFAULT
Enabled support for symmetric matrices in
CUDSS_ALG_1
andCUDSS_ALG_2
reordering algorithmsRe-enabled passing a different matrix on the solve phase (was disabled in 0.5.0), with the restrictions specified in
cudssExecute()
Breaking changes:
Disabled support for batched matrices in the MGMN mode
Changed the enum values for
cudssPhase_t
and the parameter type for phase incudssExecute
fromcudssPhase_t
toint
Important bug fixes:
Fixed a sporadic accuracy failure for non-symmetric matrices with
CUDSS_ALG_1
orCUDSS_ALG_2
for reordering (CUDSS-732).Fixed incorrect (perturbed) pivot scaling (
CUDSS_ALG_1
forCUDSS_CONFIG_PIVOT_EPSILON_ALG
) for symmetric matrices (CUDSS-706)Fixed incorrect reordering result with
CUDSS_ALG_DEFAULT
for nested dissection depth large enough compared to the number of rows (CUDSS-873)
Known issues:
Default matching algorithm (same as
CUDSS_ALG_5
) produces incorrect results for hybrid execute and hybrid memory modes when the scaling vectors are non-constant (CUDSS-882). As a workaround, one can useCUDSS_ALG_4
or other values of matching algorithm other than the default and the last options (CUDSS-882).In case the input matrix is distributed, if some processes have empty local parts of the matrix, an out-of-bounds device access can happen (CUDSS-913). As a workaround, one should make sure all processes have non-empty local parts (CUDSS-913).
cuDSS v0.5.0#
New Features:
Multi-Threaded (MT) mode with prebuilt standalone threading layer for GNU OpenMP on Linux and VCOMP (
VCOMP140.dll
) on Windows, as well as with custom user-defined threading backendsMulti-threaded reordering for
CUDSS_ALG_DEFAULT
reordering algorithmHybrid host/device execute mode to enable using the host on factorization and solve (improved performance for small matrices)
Improved performance and memory requirements for the hybrid memory mode
Blackwell GPU architecture support (
sm_100
,sm_120
,sm_101
)
Breaking changes:
Changed the function signature for
cudssMatrixCreateBatchDn()
andcudssMatrixGetBatchDn()
to have an extra argument for the integer types of the arrays with scalar parameters of the batch (likenrows
,ncols
orld
)Changed the definition of
cudssMatrixFormat_t
andcudssMatrixGetFormat()
to make the matrix format enum values become bit flags (which can be combined) and added a flagCUDSS_MFORMAT_BATCH
for querying whether acudssMatrix_t
is a batchDropped support for
SLES 15.x
forx <= 5
(upgraded minimal version ofSLES
to15.6
) and upgraded the minimal version ofGLIBC
supported to 2.28
Important bug fixes:
Fixed a bug when the internally symmetrized matrix pattern has number of nonzeros overflowing 32-bit integer maximum value (CUDSS-417).
Fixed a bug when the MGMN mode produced incorrect results for some of the larger matrices (CUDSS-601)
Fixed incorrect memory estimates returned for the hybrid memory mode (CUDSS-669)
cuDSS v0.4.0#
New Features:
Significant performance improvements for all reordering algorithms other than
CUDSS_ALG_1
andCUDSS_ALG_2
Added support for non-uniform batching (solving multiple systems with different matrices and righthand sides) via new APIs like
cudssMatrixCreateBatch<Dn|Csr>()
and others similar to existing APIs for non-batch matrix objects and extendingcudssExecute()
to support batchesAdded support for querying memory estimates via
cudssDataGet()
withCUDSS_DATA_MEMORY_ESTIMATES
Added installer support for
Ubuntu 24.04
Added support for pip wheels on pypi.org and pypi.nvidia.com and for conda packaging
Breaking changes:
Added a new value
CUDSS_DATA_MEMORY_ESTIMATES
into thecudssDataParam_t
enumStarting with this release, cuDSS has a runtime dependency on cuBLAS (usually available as a part of CUDA Toolkit)
Important bug fixes:
Added CMake config version file and fixed CMake config for system-wide installation
Known issues:
For correct system-wide installation, cudss-config.cmake makes use of
REAL_PATH
which is only available since cmake 3.19 (which is newer than the default cmake version forUbuntu 20.04
, e.g.)The MGMN mode of cuDSS with OpenMPI communication layer might run out of GPU memory due to a bug in OpenMPI/UCX. If you encounter this issue, consider setting
export UCX_MEMTYPE_CACHE=n
orexport UCX_TLS=^cuda_ipc
or switching to the NCCL communication backend as a workaround. These workarounds might lead to performance degradation. In such a case, please report it.PIP wheels for Linux + x86_64 with version 0.4.0.2 might not work on some of older OSes due to a bug in patchelf 0.18.0. On affected systems,
ldd libcudss.so.0
will return an errornot a dynamic executable
and using the shared library for linking with an application will produce an errorELF load command address/offset not properly aligned
. The suggested workaround is to use the installation commandpip install nvidia-cudss-cu12
which will install the patched wheelsnvidia-cudss-cu12 0.4.0.2.post1
.
cuDSS v0.3.0#
New Features:
Multi-GPU multi-node (MGMN) mode with prebuilt standalone communication layers for NCCL and OpenMPI, as well as with custom user-defined GPU-aware communication backends
Hybrid host/device memory mode which enables keeping the factor values in the host memory (RAM) and uses only a smaller device buffer as a temporary
Extended support to
Linux ARM(aarch64)
(Ubuntu 22.04, only on Orin (SM 8.7
) devices)
Breaking changes:
Removed values
CUDSS_STATUS_ARCH_MISMATCH
andCUDSS_STATUS_ZERO_PIVOT
from the enumcudssStatus_t
as these values will not be usedRenamed the main header
cuDSS.h
ascudss.h
to better align with other CUDA math libraries
Important bug fixes:
Fixed execution failures when multiple tiny matrices (less than 16x16) were solved re-using the same
cudssData_t
Fixed incorrect result when factorization phase followed a re-factorization phase with
CUDSS_ALG_1
andCUDSS_ALG_2
reordering algorithms
Known issue:
Error messages are seen during cuDSS installation on RPM-based systems (RHEL, Rocky, SLES)
failed to link /usr/lib/#INSTALL_TRIPLET#/libcudss_commlayer_nccl.so -> /etc/alternatives/libcudss_commlayer_nccl.so: No such file or directory
failed to link /usr/lib/#INSTALL_TRIPLET#/libcudss_commlayer_openmpi.so -> /etc/alternatives/libcudss_commlayer_openmpi.so: No such file or directory
The installation completes despite the failure to create a couple of symlinks for cuDSS.
To fix the issue, please apply the workaround ONLY after encountering the above issue. The issue and thus workaround is only applicable to RPM-based systems (RHEL, Rocky, SLES). The workaround will drop and recreate all symlinks intended for the cudss alternatives system.
update-alternatives --remove cudss /usr/lib64/libcudss/12/libcudss.so.0
/sbin/ldconfig
update-alternatives --install /usr/lib64/libcudss.so.0 cudss /usr/lib64/libcudss/12/libcudss.so.0 120 \
--slave /usr/lib64/libcudss.so libcudss.so /usr/lib64/libcudss/12/libcudss.so \
--slave /usr/lib64/libcudss_commlayer_nccl.so libcudss_commlayer_nccl.so /usr/lib64/libcudss/12/libcudss_commlayer_nccl.so \
--slave /usr/lib64/libcudss_commlayer_openmpi.so libcudss_commlayer_openmpi.so /usr/lib64/libcudss/12/libcudss_commlayer_openmpi.so \
--slave /usr/lib64/libcudss_static.a libcudss_static.a /usr/lib64/libcudss/12/libcudss_static.a \
--slave /usr/lib64/cmake/cudss cudss_cmake /usr/lib64/libcudss/12/cmake/cudss \
--slave /usr/include/cudss.h cudss.h /usr/include/libcudss/12/cudss.h \
--slave /usr/include/cudss_distributed_interface.h cudss_distributed_interface.h /usr/include/libcudss/12/cudss_distributed_interface.h
/sbin/ldconfig
After the steps are completed, confirm that all the symlinks exist. Expectation:
# ls -l /usr/lib64/cmake/cudss
... /usr/lib64/cmake/cudss -> /etc/alternatives/cudss_cmake
# ls -l /usr/include/cudss*
... /usr/include/cudss.h -> /etc/alternatives/cudss.h
... /usr/include/cudss_distributed_interface.h -> /etc/alternatives/cudss_distributed_interface.h
# ls -l /usr/lib64/*cudss*
... /usr/lib64/libcudss.so -> /etc/alternatives/libcudss.so
... /usr/lib64/libcudss.so.0 -> /etc/alternatives/cudss
... /usr/lib64/libcudss_commlayer_nccl.so -> /etc/alternatives/libcudss_commlayer_nccl.so
... /usr/lib64/libcudss_commlayer_openmpi.so -> /etc/alternatives/libcudss_commlayer_openmpi.so
... /usr/lib64/libcudss_static.a -> /etc/alternatives/libcudss_static.a
# ls -l /etc/alternatives/*cudss*
... /etc/alternatives/cudss -> /usr/lib64/libcudss/12/libcudss.so.0
... /etc/alternatives/cudss.h -> /usr/include/libcudss/12/cudss.h
... /etc/alternatives/cudss_cmake -> /usr/lib64/libcudss/12/cmake/cudss
... /etc/alternatives/cudss_distributed_interface.h -> /usr/include/libcudss/12/cudss_distributed_interface.h
... /etc/alternatives/libcudss.so -> /usr/lib64/libcudss/12/libcudss.so
... /etc/alternatives/libcudss_commlayer_nccl.so -> /usr/lib64/libcudss/12/libcudss_commlayer_nccl.so
... /etc/alternatives/libcudss_commlayer_openmpi.so -> /usr/lib64/libcudss/12/libcudss_commlayer_openmpi.so
... /etc/alternatives/libcudss_static.a -> /usr/lib64/libcudss/12/libcudss_static.a
cuDSS v0.2.1#
Important bug fixes:
Fixed host memory leaks
Fixed device memory bookkeeping which could cause read violation errors and segmentation faults when cuDSS is called repeatedly with the same cudssHandle_t and cudssData_t objects
Fixed incorrect results of iterative refinement for 1-based input matrices
Fixed incorrect internal temporary buffer size which could cause invalid memory accesses for small matrices
cuDSS v0.2.0#
New Features:
Performance improvements for non-symmetric and non-hermitian matrices for the reordering algorithm
CUDSS_ALG_1
Support for user-defined device memory allocators/memory pools
Support for extracting permutations which account for both reordering and pivoting (via new values
CUDSS_DATA_PERM_ROW
andCUDSS_DATA_PERM_COL
in the enumcudssDataParam_t
) for reordering algorithmsCUDSS_ALG_1
andCUDSS_ALG_2
Extended support to all SM architectures starting with Pascal (
SM 5.0
)Extended support to
Linux ARM(SBSA)
(Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, SLES 15)
Breaking changes:
Replaced value
CUDSS_DATA_PERM_REORDER
in the enumcudssDataParam_t
withCUDSS_DATA_PERM_REORDER_ROW
andCUDSS_DATA_PERM_REORDER_COL
to separate row and column reordering permutations which can be different for non-symmetric reordering algorithmCUDSS_ALG_1
Important bug fixes:
Fixed incorrect solution for Hermitian matrices with non-disabled pivoting
Fixed sporadically incorrect solution on H100 due to shared memory allocation size
Fixed incorrect propagation of pivoting tolerance and epsilon from
cudssConfig_t
duringcudssExecute()
Fixed sporadic hangs on GPUs with small number of SMs
cuDSS v0.1.0#
New Features:
Initial release
Support for single GPU, SM architectures:
SM 7.0
and newerSupport
Linux x86-64
(Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, SLES 15)Support
Windows x86-64
(Windows 10, 11)Support for single/double real/complex datatype for values and int datatype for indices
Synchronous API
Compatibility notes:
cuDSS requires CUDA 12.0 or above