Release Notes#
cuSOLVERMp v0.8.0#
Released: April 13, 2026
New features#
Added cusolverMpNewtonSchulz() for distributed Newton-Schulz orthogonalization of tall or square (
M >= N) matrices on supportedPx1process grids. This iterative method approximates the orthogonal polar factor of a matrix (i.e.,X^T * X → I), useful for optimizer orthogonalization in deep learning. SupportsCUDA_R_16BF(bfloat16) andCUDA_R_32F(float32) value types withCUDA_R_32Fcompute type. The iteration coefficients are user-supplied (3 floats per iteration); example coefficients optimized for quintic convergence in 5 iterations are provided in the Newton-Schulz sample. 2D block-cyclic grids and wide rectangular matrices are not yet supported.Added cusolverMpBufferRegister() and cusolverMpBufferDeregister() for registering NCCL symmetric memory buffers on the grid communicator. Pre-registering workspace buffers allocated with
ncclMemAlloccan improve performance of NCCL collective operations. Currently expected to yield performance benefits when used with cusolverMpNewtonSchulz().Added cusolverMpNewtonSchulzDescriptorCreate(), cusolverMpNewtonSchulzDescriptorDestroy(), cusolverMpNewtonSchulzDescriptorSetAttribute(), and cusolverMpNewtonSchulzDescriptorGetAttribute() for configuring Newton-Schulz behavior (e.g., input normalization, reduction precision).
Added cusolverMpOrgqr() to explicitly generate the matrix Q with orthonormal columns from a QR factorization computed by cusolverMpGeqrf().
Added cusolverMpLaset() to initialize distributed matrices with scalar values.
Added cusolverMpSetStream() API to allow setting a custom CUDA stream on the cuSOLVERMp handle after creation.
Performance#
Improved performance of cusolverMpSyevd() and cusolverMpSygvd().
Resolved Issues#
Removed a restriction where the default (
NULLor0) CUDA stream could not be passed to cusolverMpCreate() (bug 4337214). This limitation existed in all prior versions of cuSOLVERMp.Fixed a bug in cusolverMpPotrf() / cusolverMpPotrs() that could produce incorrect results or hangs on non-square process grids (e.g., 3x4, 4x3) when the matrix spanned more tiles than one cycle of the 2D block-cyclic distribution. This bug existed in versions 0.3.0 through 0.7.2.
Fixed cusolverMpOrmqr() communicator-root selection for distributed submatrix broadcasts. Affected versions: 0.3.0 through 0.7.2. In particular, versions 0.4.3 through 0.7.2 could produce incorrect results on row-major grids when
K < N.
cuSOLVERMp v0.7.2#
Released: November 10, 2025
Resolved Issues#
Fixed a bug that could cause a race condition (deadlock or incorrect results) in cusolverMpGetrf() / cusolverMpGetrs() in cuSOLVERMp versions 0.7.0 and 0.7.1.
Fixed a bug that could result in a deadlock when grids of different dimensions were used with the same cuSOLVERMp handle.
cuSOLVERMp v0.7.1#
Released: September 18, 2025
Resolved Issues#
Fixed a bug that could cause a race condition in cusolverMpGeqrf(). The bug was observed in tall-skinny QR problems on V100 GPUs.
Removed a superfluous warning printed when message size exceeded
INT32_MAXin cusolverMpStedc().Fixed a bug in FP32 emulation mode where the emulation strategy set by cusolverMpSetEmulationStrategy() was not propagated to cuSOLVER.
Added support for Compute Capability 7.5 (Turing) in CUDA 13 builds.
cuSOLVERMp v0.7.0#
Released: August 12, 2025
Breaking changes#
cuSOLVERMp has transitioned from using the Communication Abstraction Library (libcal) to using NCCL directly. This is a breaking change and requires changes to cuSOLVERMp initialization in the user application.
See Migrating from CAL to NCCL for steps to transition the application from libcal to NCCL.
The steps to initialize cuSOLVERMp with NCCL are described in NCCL Initialization.
libcal documentation page is still available at CAL Initialization (Legacy) but it is only applicable to cuSOLVERMp versions older than 0.7.0.
New features#
Added support for CUDA 13 on devices with Compute Capability 8.0 (Ampere) and above.
cuSOLVERMp leverages techniques for floating point emulation as described in cuBLAS Floating Point Emulation for improved performance (CUDA 13+ and Compute Capability 10+).
Introduced new APIs: cusolverMpSetMathMode(), cusolverMpGetMathMode(), cusolverMpSetEmulationStrategy(), and cusolverMpGetEmulationStrategy().
FP32 emulation can be enabled by setting the math mode to
CUSOLVER_FP32_EMULATED_BF16X9_MATH(see cusolverMpSetMathMode()).The emulation strategy can be further tuned using the cusolverMpSetEmulationStrategy() API.
The math mode and emulation strategy are propagated to the internal cuBLAS and cuSOLVER handles. The workspace sizes returned by
*_bufferSizeAPIs may depend on the math mode.The defaults are not changed from the previous version, i.e., emulation is disabled and math mode is set to
CUSOLVER_DEFAULT_MATH.
Resolved Issues#
Fixed a bug that could cause cusolverMpGeqrf() to fail on non-square process grids.
Fixed a bug that could cause cusolverMpGetrf()/cusolverMpGetrs() to fail on non-square process grids.
Fixed a bug that could cause cuSOLVERMp to crash when logging is enabled.
cuSOLVERMp v0.6.0#
Released: February 13, 2025
Added support for NVIDIA Blackwell GPU architecture.
Dropped support for CUDA 11.x.
cuSOLVERMp v0.5.1#
Released: August 22, 2024
Fixed a bug in cusolverMpSyevd() where the eigenvalues were not broadcasted to all the processes if the problem fit on a single process.
cuSOLVERMp v0.5.0#
Released: May 2, 2024
Improved the performance of cusolverMpStedc().
Introduced a new option to force NCCL usage by setting the
CUSOLVERMP_FORCE_NCCL=1environment flag. This is only applicable in parts of the eigensolver for now.
cuSOLVERMp v0.4.3#
Released: February 5, 2024
Supported CUDA 12.1.1.
Fixed a bug that processors are hanging when a problem is tiny and fits in a single processor.
Known Issues#
CUDA 12.1.1 is compatible with NCCL up to v2.16.x; higher NCCL version may hang intermittently for certain processor grids.
cuSOLVERMp v0.4.2#
Released: HPC SDK 23.11
Fixed a bug in cusolverMpSyevd() that the code returns an internal error for a matrix filled with zero entries; the correct behavior is to return zero eigenvalues and unit eigenvectors.
Supported CUDA 12.1.1
Note that the code is compatible with NCCL up to v2.16.x
cuSOLVERMp v0.4.1#
Released: HPC SDK 23.7
Added support for row major grid in SYEVD.
cuSOLVERMp v0.4.0#
Released: HPC SDK 23.5
Added routines for symmetric (Hermitian) generalized eigen solver
cusolverMpSygst() reduces the symmetric (Hermitian) generalized eigen problem to standard form.
cusolverMpSygvd() computes all eigenvalues and eigenvectors of symmetric (Hermitian) generalized eigen problem.
cuSOLVERMp v0.3.1#
Released: HPC SDK 23.3
Minor bugfixes are included
cusolverMpPotrf() fixes to result cleans zeros of the imaginary part of diagonals.
cusolverMpStedc() fixes internal memory leak.
cuSOLVERMp v0.3.0#
Released: HPC SDK 23.1
Removed dependency on MPI, now UCC library is the main communication backend
Provide the following computational APIs:
cusolverMpGeqrf_bufferSize(), cusolverMpGeqrf(), cusolverMpOrmqr_bufferSize(), cusolverMpOrmqr(), cusolverMpGels_bufferSize(), cusolverMpGels(), cusolverMpSytrd_bufferSize(), cusolverMpSytrd(), cusolverMpStedc_bufferSize(), cusolverMpStedc(), cusolverMpOrmtr_bufferSize(), cusolverMpOrmtr(), cusolverMpSyevd_bufferSize(), cusolverMpSyevd().
Note that cusolverMpGels() currently supports least square solutions with no-transpose option only.
Note that cusolverMpSytrd(), cusolverMpOrmtr() and cusolverMpSyevd() currently support a lower triangular input matrix only.
cuSOLVERMp v0.2.0#
Released: HPC SDK 22.05
Added support for
pp64 + SpectrumMPI, targeting ORNL’s Summit Supercomputer.Added Cholesky factorization and solve APIs:
Note that cusolverMpGetrs() does not offer support for multiple right-hand sides at this point.
cuSOLVERMp v0.1.0#
Released: HPC SDK 21.11
Initial release.
Support
Linux x86_64andCompute Capability 8.0.Provide the following computational APIs:
Note that cusolverMpGetrs() does not offer support for multiple right-hand sides at this point.