Release Notes#
cuDSS v0.8.0#
New Features:
Introduced double-double type: new higher precision data type
CUDSS_R_64F_64Fwith supporting value structcudss_fp64mp2_tExpanded hybrid execute mode (
CUDSS_CONFIG_HYBRID_EXECUTE_MODE) to now cover the analysis phase (in addition to factorization and solve) requiring much less device memory. Also, hybrid execute mode now supports multi-GPU (MG) and multi-GPU multi-node (MGMN) modesAdded support for host memory pointers in CSR input matrix arrays (
csr_offsets,csr_columns,csr_values) in all execution modes, not just hybrid mode.Introduced
offsetTypeto enableint64_trow offsets withint32_tcolumn indices. See Adding offsetType parameter in CSR matrix functions in the migration guideExtended communication layer definition to include APIs for a separate backend for performing host memory buffers and introduced
CUDSS_DATA_COMM_DEVICEandCUDSS_DATA_COMM_HOSTto separate host and device communicatorsEnabled
CUDSS_CONFIG_IR_TOLand introducedCUDSS_DATA_IR_N_STEPSto report the number of iterative refinement steps actually performedAdded support for selectively processing matrices within a uniform batch via
CUDSS_DATA_UBATCH_MASKIntroduced
cudssSetMgStreams()to assign per-device CUDA streams in multi-GPU modeEnabled use of
CUDSS_DATA_USER_PERMandCUDSS_DATA_USER_ND_PARTITION_TREEin Schur complement modeAdded support for programmatic control over cuDSS logging without environment variables. See cuDSS Logging Features for details
Added NVTX ranges to support profiling with Nsight Systems.
Introduced parameter
CUDSS_REORDERING_ALG_NONEfor reordering to use natural ordering thus bypassing all internal reorderingAdded more values to
cudssPivotType_tto clarify pivot strategies including new defaultCUDSS_PIVOT_AUTOAdded new read-only data parameter
CUDSS_DATA_FLOPSreturning the floating-point operation count for factorizationAdded parameter
CUDSS_CONFIG_ND_UBFACTORto control the balance of the nested dissection partition tree Setting this parameter can help improve performance in MGMN mode and MG modeImproved performance for solve phase with multiple right-hand sides
Improved performance for factorization with uniform batches
Breaking changes:
All breaking changes are described in detail in the migration guide.
Replaced
cudaDataType_twithcudssDataType_tin all matrix creation and query function signatures. The new enum values mirror those incudaDataType_t. See Replacing cudaDataType_t with cudssDataType_tIntroduced
offsetTypeto enable int64 row offsets with int32 column indices for CSR matrices. See Adding offsetType parameter in CSR matrix functionsRemoved
cudssAlgType_tand introduced five new per-category enum types. See Restructuring algorithm enumsRedesigned
cudssDistributedInterface_tto facilitate independent host/device communicatorsCUDSS_DATA_COMMhas been removed.CUDSS_DATA_COMM_DEVICEandCUDSS_DATA_COMM_HOSThave been added for setting communicators. See Replacing CUDSS_DATA_COMM.Existing function pointers in
cudssDistributedInterface_thave been renamed withHostsuffix and newDevicevariants have been added;cudaDataType_treplaced bycudssDataType_tin all data-type parameters. Custom communication layer implementations must be updated. See Updating custom communication layersAdded new
cudssDistributedGetPropertyfunction pointer for communication layer version information. Custom communication layer implementations must be updated. See Adding GetProperty functions to custom communication/threading layers
Added new
cudssThreadingGetPropertyfunction pointer tocudssThreadingInterface_tfor threading layer version information. Custom threading-layer implementations must be updated. See Adding GetProperty functions to custom communication/threading layersAdded
constqualifiers across the entire API (config, data, execute, matrix create/set/get functions). See Reacting to const additions in APIsRenamed parameters:
Old
New (0.8.0)
CUDSS_CONFIG_HYBRID_MODECUDSS_DATA_USER_ELIMINATION_TREECUDSS_DATA_ELIMINATION_TREECUDSS_PIVOT_COLCUDSS_PIVOT_ROW
Important bug fixes:
Fixed issue with pivoting for symmetric indefinite systems (CUDSS-363)
Fixed issue where permutation returned by cuDSS (
CUDSS_DATA_PERM_REORDER_ROW) led to wrong results when used withCUDSS_DATA_USER_PERMandCUDSS_DATA_USER_ND_PARTITION_TREE(CUDSS-1020)Fixed crash when using AMD reordering with diagonal matrices (CUDSS-1030)
Fixed incorrect behavior when making separate calls to
CUDSS_PHASE_REORDERINGandCUDSS_PHASE_SYMBOLIC_FACTORIZATIONwhen matching is enabled with symmetric matrices (CUDSS-1078)Fixed issue causing spurious MPI processes in MGMN mode (CUDSS-1082)
Added graceful exits when memory allocation fails to fix potential hangs (CUDSS-1163)
Fixed incorrect OpenMP flags used for building the threading layer for Windows (CUDSS-1199)
CUDSS_DATA_DIAGnow returns diagonal values in the input order rather than in an internal order (CUDSS-1290)Fixed integer overflow affecting large matrices with superpanels (CUDSS-1292, CUDSS-1293)
Known issues:
Matrices with more than \(2^{31}\) rows or columns are not supported
compute-sanitizer may report spurious uninitialized reads on Windows in multi-GPU (MG) mode (CUDSS-1026)
A user interrupt during the reordering phase with a matching algorithm (other than the default
CUDSS_MATCHING_ALG_NONE) can lead to a crash. Interrupts without matching or other phases are not affected (CUDSS-1377)compute-sanitizer with
--tool synccheckmight fail on the Jetson Orin hardware (Compute Capability 87) with CUDA error 719. Running with a different sanitizer (or none) won’t cause issues (CUDSS-1379)
cuDSS v0.7.1#
New Features:
Enabled
CUDSS_DATA_USER_ELIMINATION_TREE: It can be used in combination withCUDSS_DATA_USER_PERMto set both the user permutation and elimination tree with values from a previous run, maintaining the factorization and solve performance while improving the reordering time.
Important bug fixes:
Fixed performance regression in the reordering phase with the default reordering algorithm for matrices with block structure (also known as BSR) (CUDSS-1035).
Known issues:
Changing the data type of the matrix while reusing an existing
cudssHandle_twill not lead to correct results and might crash (CUDSS-1019).Matrices with more than \(2^{31}\) rows or columns are not supported.
compute-sanitizer might report spurious uninitialized read on Windows in multi-GPU (MG) mode (CUDSS-1026). Please report if observed.
On Linux, depending on the version of GNU OpenMP runtime, valgrind and sanitizer tools may crash or report a memory leak during the call to
cudssDestroy()if multi-threaded mode is used (CUDSS-1079). This happens when cuDSS is callingdlclose()for the multi-threading layer library and is caused by GNU OpenMP runtime unloading. As a workaround, setGOMP_SPINCOUNTto0in the environment or add a call todlopen()for thelibgomp.soshared library in the application (before callingcudssDestroy()).Reordering algorithm
CUDSS_ALG_3may crash on very small matrices (CUDSS-1030)Separate calls to
CUDSS_PHASE_REORDERINGandCUDSS_PHASE_SYMBOLIC_FACTORIZATIONwith matrix types other thanCUDSS_MTYPE_GENERALwith matching enabled produce incorrect results (CUDSS-1078). As a workaround, useCUDSS_PHASE_ANALYSIS.
cuDSS v0.7.0#
New Features:
Enabled multi-GPU single-node (MG) mode via
cudssCreateMg()andCUDSS_CONFIG_DEVICE_COUNTandCUDSS_CONFIG_DEVICE_INDICESRe-introduced factorization algorithm
CUDSS_ALG_1with improved performance for matrices with very sparse factorsAdded support for
int64_tindexing for sparse matricesEnabled turning on and off superpanel optimization (
CUDSS_CONFIG_USE_SUPERPANELS)Enabled extracting the auxiliary elimination tree structure (
CUDSS_DATA_ELIMINATION_TREE)Added support for solve sub-phases (
CUDSS_PHASE_SOLVE_FWD,CUDSS_PHASE_SOLVE_FWD_PERM,CUDSS_PHASE_SOLVE_BWD,CUDSS_PHASE_SOLVE_BWD_PERM)Added support for Schur complement mode (
CUDSS_CONFIG_SCHUR_MODE,CUDSS_DATA_USER_SCHUR_INDICES,CUDSS_DATA_SCHUR_SHAPE,CUDSS_DATA_SCHUR_MATRIX)Added support for deterministic mode (
CUDSS_CONFIG_DETERMINISTIC_MODE)Enabled user interruption on the host (
CUDSS_DATA_USER_HOST_INTERRUPT)Improved performance for the symbolic factorization phase with
CUDSS_ALG_DEFAULTandCUDSS_ALG_3reordering algorithmsEnabled support for CUDA 13 and new Blackwell GPU architectures (incl. GB300 and DGX Spark)
Breaking changes:
All user provided data (provided with the prefix
CUDSS_DATA_USER_) is now copied to an internal buffer with acudssDataSetoperation and copied back with acudssDataGetoperation. This breaks the call ofcudssDataGetwithCUDSS_DATA_USER_PERMas it now copies the data to the provided pointer instead of copying the pointer.
Important bug fixes:
Fixed incorrect results for the default matching algorithm for hybrid memory mode and hybrid execute mode, when the scaling vectors are non-constant (CUDSS-882).
Fixed the crash when in MGMN mode some of the ranks have empty local parts of the matrix (CUDSS-913).
Fixed out-of-bounds access during reordering for very large matrices (CUDSS-948).
Fixed incorrect results for symmetric matrices with superpanel and pivoting options enabled (CUDSS-1003).
Known issues:
Setting
CUDSS_DATA_USER_ELIMINATION_TREEis currently not supported and leads to undefined behavior (CUDSS-1020).Changing the data type of the matrix while reusing an existing
cudssHandle_twill not lead to correct results and might crash (CUDSS-1019)Matrices with more than \(2^{31}\) rows or columns are not supported.
compute-sanitizer might report spurious uninitialized read on Windows in multi_GPU (MG) mode (CUDSS-1026)
Performance regression on reordering phase with the default reordering algorithm for matrices with block structure (aka BSR) (CUDSS-1035)
cuDSS v0.6.0#
New Features:
Enabled support for distributed matrices with a row-wise potentially overlapping distribution via
cudssMatrixSetDistributionRow1d()Enabled support for matching and scaling with
CUDSS_CONFIG_USE_MATCHINGandCUDSS_CONFIG_MATCHING_ALGEnabled support for solving uniform batches (same sparsity pattern) of linear systems via
CUDSS_CONFIG_UBATCH_SIZEandCUDSS_CONFIG_UBATCH_INDEXExtended support for the hybrid memory mode from default (single GPU) to MGMN mode
Enabled support for combining the phases for
cudssExecute()in a single call and added separate phasesCUDSS_PHASE_REORDERINGandCUDSS_PHASE_SYMBOLIC_FACTORIZATIONfor reordering and symbolic factorizationEnabled a new configuration parameter
CUDSS_CONFIG_ND_NLEVELSwhich can control the nested dissection depth for reordering withCUDSS_ALG_DEFAULTEnabled support for symmetric matrices in
CUDSS_ALG_1andCUDSS_ALG_2reordering algorithmsRe-enabled passing a different matrix on the solve phase (was disabled in 0.5.0), with the restrictions specified in
cudssExecute()
Breaking changes:
Disabled support for batched matrices in the MGMN mode
Changed the enum values for
cudssPhase_tand the parameter type for phase incudssExecutefromcudssPhase_ttoint
Important bug fixes:
Fixed a sporadic accuracy failure for non-symmetric matrices with
CUDSS_ALG_1orCUDSS_ALG_2for reordering (CUDSS-732).Fixed incorrect (perturbed) pivot scaling (
CUDSS_ALG_1forCUDSS_CONFIG_PIVOT_EPSILON_ALG) for symmetric matrices (CUDSS-706)Fixed incorrect reordering result with
CUDSS_ALG_DEFAULTfor nested dissection depth large enough compared to the number of rows (CUDSS-873)
Known issues:
Default matching algorithm (same as
CUDSS_ALG_5) produces incorrect results for hybrid execute and hybrid memory modes when the scaling vectors are non-constant (CUDSS-882). As a workaround, one can useCUDSS_ALG_4or other values of matching algorithm other than the default and the last options (CUDSS-882).In case the input matrix is distributed, if some processes have empty local parts of the matrix, an out-of-bounds device access can happen (CUDSS-913). As a workaround, one should make sure all processes have non-empty local parts (CUDSS-913).
cuDSS v0.5.0#
New Features:
Multi-Threaded (MT) mode with prebuilt standalone threading layer for GNU OpenMP on Linux and VCOMP (
VCOMP140.dll) on Windows, as well as with custom user-defined threading backendsMulti-threaded reordering for
CUDSS_ALG_DEFAULTreordering algorithmHybrid host/device execute mode to enable using the host on factorization and solve (improved performance for small matrices)
Improved performance and memory requirements for the hybrid memory mode
Blackwell GPU architecture support (
sm_100,sm_120,sm_101)
Breaking changes:
Changed the function signature for
cudssMatrixCreateBatchDn()andcudssMatrixGetBatchDn()to have an extra argument for the integer types of the arrays with scalar parameters of the batch (likenrows,ncolsorld)Changed the definition of
cudssMatrixFormat_tandcudssMatrixGetFormat()to make the matrix format enum values become bit flags (which can be combined) and added a flagCUDSS_MFORMAT_BATCHfor querying whether acudssMatrix_tis a batchDropped support for
SLES 15.xforx <= 5(upgraded minimal version ofSLESto15.6) and upgraded the minimal version ofGLIBCsupported to 2.28
Important bug fixes:
Fixed a bug when the internally symmetrized matrix pattern has number of nonzeros overflowing 32-bit integer maximum value (CUDSS-417).
Fixed a bug when the MGMN mode produced incorrect results for some of the larger matrices (CUDSS-601)
Fixed incorrect memory estimates returned for the hybrid memory mode (CUDSS-669)
cuDSS v0.4.0#
New Features:
Significant performance improvements for all reordering algorithms other than
CUDSS_ALG_1andCUDSS_ALG_2Added support for non-uniform batching (solving multiple systems with different matrices and righthand sides) via new APIs like
cudssMatrixCreateBatch<Dn|Csr>()and others similar to existing APIs for non-batch matrix objects and extendingcudssExecute()to support batchesAdded support for querying memory estimates via
cudssDataGet()withCUDSS_DATA_MEMORY_ESTIMATESAdded installer support for
Ubuntu 24.04Added support for pip wheels on pypi.org and pypi.nvidia.com and for conda packaging
Breaking changes:
Added a new value
CUDSS_DATA_MEMORY_ESTIMATESinto thecudssDataParam_tenumStarting with this release, cuDSS has a runtime dependency on cuBLAS (usually available as a part of CUDA Toolkit)
Important bug fixes:
Added CMake config version file and fixed CMake config for system-wide installation
Known issues:
For correct system-wide installation, cudss-config.cmake makes use of
REAL_PATHwhich is only available since cmake 3.19 (which is newer than the default cmake version forUbuntu 20.04, e.g.)The MGMN mode of cuDSS with OpenMPI communication layer might run out of GPU memory due to a bug in OpenMPI/UCX. If you encounter this issue, consider setting
export UCX_MEMTYPE_CACHE=norexport UCX_TLS=^cuda_ipcor switching to the NCCL communication backend as a workaround. These workarounds might lead to performance degradation. In such a case, please report it.PIP wheels for Linux + x86_64 with version 0.4.0.2 might not work on some of older OSes due to a bug in patchelf 0.18.0. On affected systems,
ldd libcudss.so.0will return an errornot a dynamic executableand using the shared library for linking with an application will produce an errorELF load command address/offset not properly aligned. The suggested workaround is to use the installation commandpip install nvidia-cudss-cu12which will install the patched wheelsnvidia-cudss-cu12 0.4.0.2.post1.
cuDSS v0.3.0#
New Features:
Multi-GPU multi-node (MGMN) mode with prebuilt standalone communication layers for NCCL and OpenMPI, as well as with custom user-defined GPU-aware communication backends
Hybrid host/device memory mode which enables keeping the factor values in the host memory (RAM) and uses only a smaller device buffer as a temporary
Extended support to
Linux ARM(aarch64)(Ubuntu 22.04, only on Orin (SM 8.7) devices)
Breaking changes:
Removed values
CUDSS_STATUS_ARCH_MISMATCHandCUDSS_STATUS_ZERO_PIVOTfrom the enumcudssStatus_tas these values will not be usedRenamed the main header
cuDSS.hascudss.hto better align with other CUDA math libraries
Important bug fixes:
Fixed execution failures when multiple tiny matrices (less than 16x16) were solved re-using the same
cudssData_tFixed incorrect result when factorization phase followed a re-factorization phase with
CUDSS_ALG_1andCUDSS_ALG_2reordering algorithms
Known issue:
Error messages are seen during cuDSS installation on RPM-based systems (RHEL, Rocky, SLES)
failed to link /usr/lib/#INSTALL_TRIPLET#/libcudss_commlayer_nccl.so -> /etc/alternatives/libcudss_commlayer_nccl.so: No such file or directory
failed to link /usr/lib/#INSTALL_TRIPLET#/libcudss_commlayer_openmpi.so -> /etc/alternatives/libcudss_commlayer_openmpi.so: No such file or directory
The installation completes despite the failure to create a couple of symlinks for cuDSS.
To fix the issue, please apply the workaround ONLY after encountering the above issue. The issue and thus workaround is only applicable to RPM-based systems (RHEL, Rocky, SLES). The workaround will drop and recreate all symlinks intended for the cudss alternatives system.
update-alternatives --remove cudss /usr/lib64/libcudss/12/libcudss.so.0
/sbin/ldconfig
update-alternatives --install /usr/lib64/libcudss.so.0 cudss /usr/lib64/libcudss/12/libcudss.so.0 120 \
--slave /usr/lib64/libcudss.so libcudss.so /usr/lib64/libcudss/12/libcudss.so \
--slave /usr/lib64/libcudss_commlayer_nccl.so libcudss_commlayer_nccl.so /usr/lib64/libcudss/12/libcudss_commlayer_nccl.so \
--slave /usr/lib64/libcudss_commlayer_openmpi.so libcudss_commlayer_openmpi.so /usr/lib64/libcudss/12/libcudss_commlayer_openmpi.so \
--slave /usr/lib64/libcudss_static.a libcudss_static.a /usr/lib64/libcudss/12/libcudss_static.a \
--slave /usr/lib64/cmake/cudss cudss_cmake /usr/lib64/libcudss/12/cmake/cudss \
--slave /usr/include/cudss.h cudss.h /usr/include/libcudss/12/cudss.h \
--slave /usr/include/cudss_distributed_interface.h cudss_distributed_interface.h /usr/include/libcudss/12/cudss_distributed_interface.h
/sbin/ldconfig
After the steps are completed, confirm that all the symlinks exist. Expectation:
# ls -l /usr/lib64/cmake/cudss
... /usr/lib64/cmake/cudss -> /etc/alternatives/cudss_cmake
# ls -l /usr/include/cudss*
... /usr/include/cudss.h -> /etc/alternatives/cudss.h
... /usr/include/cudss_distributed_interface.h -> /etc/alternatives/cudss_distributed_interface.h
# ls -l /usr/lib64/*cudss*
... /usr/lib64/libcudss.so -> /etc/alternatives/libcudss.so
... /usr/lib64/libcudss.so.0 -> /etc/alternatives/cudss
... /usr/lib64/libcudss_commlayer_nccl.so -> /etc/alternatives/libcudss_commlayer_nccl.so
... /usr/lib64/libcudss_commlayer_openmpi.so -> /etc/alternatives/libcudss_commlayer_openmpi.so
... /usr/lib64/libcudss_static.a -> /etc/alternatives/libcudss_static.a
# ls -l /etc/alternatives/*cudss*
... /etc/alternatives/cudss -> /usr/lib64/libcudss/12/libcudss.so.0
... /etc/alternatives/cudss.h -> /usr/include/libcudss/12/cudss.h
... /etc/alternatives/cudss_cmake -> /usr/lib64/libcudss/12/cmake/cudss
... /etc/alternatives/cudss_distributed_interface.h -> /usr/include/libcudss/12/cudss_distributed_interface.h
... /etc/alternatives/libcudss.so -> /usr/lib64/libcudss/12/libcudss.so
... /etc/alternatives/libcudss_commlayer_nccl.so -> /usr/lib64/libcudss/12/libcudss_commlayer_nccl.so
... /etc/alternatives/libcudss_commlayer_openmpi.so -> /usr/lib64/libcudss/12/libcudss_commlayer_openmpi.so
... /etc/alternatives/libcudss_static.a -> /usr/lib64/libcudss/12/libcudss_static.a
cuDSS v0.2.1#
Important bug fixes:
Fixed host memory leaks
Fixed device memory bookkeeping which could cause read violation errors and segmentation faults when cuDSS is called repeatedly with the same cudssHandle_t and cudssData_t objects
Fixed incorrect results of iterative refinement for 1-based input matrices
Fixed incorrect internal temporary buffer size which could cause invalid memory accesses for small matrices
cuDSS v0.2.0#
New Features:
Performance improvements for non-symmetric and non-hermitian matrices for the reordering algorithm
CUDSS_ALG_1Support for user-defined device memory allocators/memory pools
Support for extracting permutations which account for both reordering and pivoting (via new values
CUDSS_DATA_PERM_ROWandCUDSS_DATA_PERM_COLin the enumcudssDataParam_t) for reordering algorithmsCUDSS_ALG_1andCUDSS_ALG_2Extended support to all SM architectures starting with Pascal (
SM 5.0)Extended support to
Linux ARM(SBSA)(Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, SLES 15)
Breaking changes:
Replaced value
CUDSS_DATA_PERM_REORDERin the enumcudssDataParam_twithCUDSS_DATA_PERM_REORDER_ROWandCUDSS_DATA_PERM_REORDER_COLto separate row and column reordering permutations which can be different for non-symmetric reordering algorithmCUDSS_ALG_1
Important bug fixes:
Fixed incorrect solution for Hermitian matrices with non-disabled pivoting
Fixed sporadically incorrect solution on H100 due to shared memory allocation size
Fixed incorrect propagation of pivoting tolerance and epsilon from
cudssConfig_tduringcudssExecute()Fixed sporadic hangs on GPUs with small number of SMs
cuDSS v0.1.0#
New Features:
Initial release
Support for single GPU, SM architectures:
SM 7.0and newerSupport
Linux x86-64(Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, SLES 15)Support
Windows x86-64(Windows 10, 11)Support for single/double real/complex datatype for values and int datatype for indices
Synchronous API
Compatibility notes:
cuDSS requires CUDA 12.0 or above