Release Notes#
This section lists significant changes, new features, performance improvements, and various issues for each release of cuSolverDx. Unless noted, listed issues should not impact functionality. When functionality is impacted, we offer a work-around to avoid the issue (if available).
0.4.0#
New Features#
BDSVD and GESVD functions now support computing left and/or right singular vectors in addition to singular values. See bdsvd for the bidiagonal SVD device API and job options (e.g. job options), and gesvd for the general matrix SVD device API and job options (e.g. U/VT storage).
Thread operators: cuSolverDx introduces the Thread operator, which runs the solver in a thread context so that each thread executes the operation independently. This is intended for small problem sizes that fit in registers; see Execution Operators and the Thread operator for usage and restrictions.
Support for batched and non-batched Modified LU factorization of unitary matrices, useful for constructing the explicit tall-skinny Householder reflector after Tall-Skinny QR (TSQR).
Performance improvements for heev function on Hopper.
More examples, including examples using the new thread operator, are included.
Breaking Changes#
cuSolverDx 0.4.0 requires CUDA Toolkit 13.0 or newer.
Support for SM70 and SM72 CUDA architectures is removed in cuSolverDx 0.4.0.
The
suggested_leading_dimensiontrait has been removed. With thread operators and the expanded set of functions, a single recommended leading dimension is no longer meaningful, and the confusion it caused outweighed its usefulness. For matrices in shared memory, use the LeadingDimension operator to specify padding when necessary and tune for the best performance in your use case.
Known Issues#
Compiler / static assertions (block execution). cuSolverDx 0.4.0 triggers static assertions at compile time for configurations known to be affected by specific compiler bugs. If your build fails with one of these assertions, you can define the corresponding macro to ignore the check and verify correctness yourself, or use a CUDA Toolkit version where the bug is fixed.
NVBUG 5972531 — CTK 13.2+, SM100+,
htevwith vectors; a few specific sizes and block dimensions. May produce wrong results. DefineCUSOLVERDX_IGNORE_NVBUG_5972531_ASSERTto suppress the assertion, or use CTK 13.0 or 13.1 to avoid the bug.NVBUG 5986343 — CTK 13.0/13.1, SM80/86/89/90,
bdsvdwith vectors; a few specific sizes and block dimensions. May produce wrong results. DefineCUSOLVERDX_IGNORE_NVBUG_5986343_ASSERTto suppress the assertion, or use CTK 13.2+ to avoid the bug.NVBUG 5908003 — CTK 13.2+, SM100+,
gesvdwith vectors, complex; a few specific sizes (and block dimensions for V1). May produce wrong results. DefineCUSOLVERDX_IGNORE_NVBUG_5908003_ASSERTto suppress the assertion, or use CTK 13.0 or 13.1 to avoid the bug.NVBUG 5288270 — CTK 13.x, SM120,
gesv_no_pivot, real; a few specific matrix sizes and block dimensions. May produce wrong results (release builds). DefineCUSOLVERDX_IGNORE_NVBUG_5288270_ASSERTto suppress the assertion, or use a CUDA Toolkit build where the issue is fixed.
To highlight these issues cuSolverDx will hard fail with a static assertion if these conditions are met.
If your build fails with one of these assertions:
Prefer using a CUDA Toolkit version where the bug is fixed (see each bullet above).
Alternatively, define the corresponding
CUSOLVERDX_IGNORE_NVBUG_<id>_ASSERTmacro to suppress the assertion and verify correctness of results yourself.Adding
-Xptxas -O0to the compile command will limit PTX optimization phase and produce correct binary, although potentially slower.
0.3.0#
New Features#
Support for batched or non-batched Singular Value Decomposition for bidiagonal matrix.
Support for batched or non-batched Singular Value Decomposition for general matrix.
Support for batched or non-batched Eigenvalue Solver for symmetric or Hermitian tridiagonal matrix.
Support for batched or non-batched Eigenvalue Solver for symmetric or Hermitian matrix.
Support for batched or non-batched General tridiagonal linear system solver.
Support for batched or non-batched Matrix Q generation from QR factorization.
Support for batched or non-batched Matrix Q generation from LQ factorization.
Performance improvements for trsm, unmqr, unmlq, geqrf, gelqf, and gels functions on Hopper.
Reorganized the examples and more examples included.
Breaking Changes#
cuSolverDx 0.3.0 updates the dimensions for function trsm, to be consistent with Lapack trsm and cublasXtrsm notation. Specifically,
Mis the number of rows in matrixB, andNis the number of columns in matrixB, with the triangular matrixAsized accordingly.K, if specified, is ignored.To use the IO-related copy_2d methods or shared memory management tools, the header file
cusolverdx_io.hppneeds to be included in your code.
0.2.1#
New Features#
Support for CUDA 13.0.
Support for Thor SM renaming from
sm_101tosm_110starting from CUDA 13.0. * Note: For CUDA 12.9 and older releases, Thor stays labeled assm_101.
Known Issues#
CUDA 12.8, 12.9 and 13.0 could miscompile kernels using
gesv_no_pivotfunction with high register pressure when all of the following conditions are met:SM is set to 1200 (
SM120), andType is set to
type::real.
The code corruptions may manifest as intermittent incorrect results.
To highlight this issue cuSolverDx 0.2.1 will overprotectively hard fail if these conditions are met.
If you are using cuSolverDx, and this happens to your code, you can:
Define the
CUSOLVERDX_IGNORE_NVBUG_5288270_ASSERTmacro to ignore these assertions and verify correctness of the resultsIf the case is indeed affected by the bug, adding the
-Xptxas -O0flag to the compilation command will limit PTX optimization phase and produce correct binary, although potentially slower
0.2.0#
New Features#
Support for batched or non-batched LU decomposition with partial pivoting.
Support for batched or non-batched linear system solves with multiple right-hand sides using LU factors with partial pivoting.
Support for batched or non-batched QR factorization.
Support for batched or non-batched LQ factorization.
Support for batched or non-batched Multiplication of Q from QR factorization.
Support for batched or non-batched Multiplication of Q from LQ factorization.
Support for batched or non-batched Least squares solves.
Support for batched or non-batched Triangular solves.
Support for Blackwell architectures
sm_100,sm_101,sm_120; experimental support forsm_103,sm_121.Deprecation of support for NVIDIA Xavier Tegra SoC (
SM<720>orsm_72).
Breaking Changes#
cuSolverDx 0.2.0 requires CUDA Toolkit 12.6.3 or newer.
cuSolverDx 0.2.0 requires FillMode operator to be specified explicitly for Cholesky functions (potrf, potrs, posv), instead of using the default value if not specified.
Known Issues#
CUDA 12.8.0 and 12.9.0 could miscompile kernels using
gesv_no_pivotfunction with high register pressure when all of the following conditions are met:SM is set to 1200 (
SM120), andType is set to
type::real.
The code corruptions may manifest as intermittent incorrect results.
To highlight this issue cuSolverDx 0.2.0 will overprotectively hard fail if these conditions are met.
If you are using cuSolverDx, and this happens to your code, you can:
Define the
CUSOLVERDX_IGNORE_NVBUG_5288270_ASSERTmacro to ignore these assertions and verify correctness of the resultsIf the case is indeed affected by the bug, adding the
-Xptxas -O0flag to the compilation command will limit PTX optimization phase and produce correct binary, although potentially slower
0.1.0#
The first early access (EA) release of cuSolverDx library.
New Features#
Support for batched or non-batched Cholesky decomposition.
Support for batched or non-batched linear system solves with multiple right-hand sides using Cholesky factors.
Support for batched or non-batched LU decomposition without pivoting.
Support for batched or non-batched linear system solves with multiple right-hand sides using LU factors without pivoting.
Support for SM70 - SM90 CUDA architectures.
Multiple examples included.
Known Issues#
For CUDA Toolkits older than 12.1 Update 1 when using NVRTC and nvJitLink to perform runtime compilation of kernels that use cuSolverDx it is required to use fatbin file
libcusolverdx.fatbinwhen linking instead oflibcusolverdx.a. Path to the fatbin library file is defined incusolverdx_FATBINcmake variable.NVCC compiler in CUDA Toolkit from 12.2 to 12.4 reports incorrect compilation error when
value_typetype of a solver description type is used. The affected code with workarounds are presented below. The issue is fixed in CTK 12.5 and newer.using Solver = decltype(Size<32>() + Precision<double>() + Type<type::complex>() + Function<function::potrf>() + SM<800>() + Block()); using type = typename Solver::value_type; // compilation error // Workaround #1 using type = typename decltype(Solver())::value_type; // Workaround #2 (used in cuSolverDx examples) template <typename T> using value_type_t = typename T::value_type; using type = value_type_t<Solver>;