Release Notes#

cuBLASMp v0.5.1#

Released: August 11, 2025

Added support for CUDA 13 on devices with Compute Capability 8.0 (Ampere) and above.

Released: June 16, 2025

cuBLASMp has transitioned from using the Communication Abstraction Library (libcal) to using NCCL directly. This is a breaking change and requires changes to cuBLASMp initialization in the user application.
- See Migrating from CAL to NCCL for steps to transition the application from libcal to NCCL.
- The steps to initialize cuBLASMp with NCCL are described in NCCL Initialization.
- libcal documentation page is still available at CAL Initialization (Legacy) but it is only applicable to cuBLASMp versions older than 0.5.0.

Added support for NT and TT cases in AllGather + GEMM and GEMM + ReduceScatter variants of the cublasMpMatmul API.
Added support for epilogues in the cublasMpMatmul API. See cublasMpMatmulEpilogue_t for more details. Epilogues can be set using cublasMpMatmulDescriptorAttributeSet.
Added support for tensor-wide scaling for cublasMpMatmul API with FP8 inputs (Compute Capability 9.0+). It can be used via cublasMpMatmulDescriptorAttributeSet (see cublasMpMatmulDescriptorAttribute_t and cublasMpMatmulMatrixScale_t).

Released: March 10, 2025

Added support for NVIDIA Blackwell GPU architecture.
Added support for GEMM + AllReduce using the cublasMpMatmul API.
Added support for NN cases in AllGather + GEMM and GEMM + ReduceScatter variants of the cublasMpMatmul API.
Bug fixes.
Deprecated functionality: atomic Matmul with multicast communication (i.e., cublasMpMatmul with cublasMpMatmulAlgoType_t = CUBLASMP_MATMUL_ALGO_TYPE_ATOMIC_MULTICAST). The functional implementation remains available but it is not performant and will be removed in a future release.

Released: December 10, 2024

Add option to set the amount of SMs to be used for communication (currently relevant only for Atomic GEMM + ReduceScatter).
Decrease workspace size requirement in TP overlap GEMMs.
Remove extra synchronization in TP overlap GEMMs.
Allow C matrix to be null when beta is 0.
Fix GEMM implementation for complex types with transA / transB being CUBLAS_OP_T.

Released: November 4, 2024

Added new cublasMpMatmul API.
Added GEMM/Matmul fast paths required for tensor parallelism (communication-computation-overlapped AllGather+GEMM and GEMM+ReduceScatter).
Added FP8 support.
Added cublasMpStatus_t.
Added cublasMpStreamSet and cublasMpStreamGet.
Added cublasMpMatrixDescriptorInit API to allow reusing matrix descriptors.
Added NVSHMEM dependency.
Added Matmul helper APIs: cublasMpMatmulDescriptorCreate, cublasMpMatmulDescriptorDestroy, cublasMpMatmulDescriptorAttributeSet, cublasMpMatmulDescriptorAttributeGet.
Dropped support for CUDA 11.x.
Bug fixes.

Removed the cublasMpHandle_t parameter from cublasMpGridCreate, cublasMpGridDestroy, cublasMpMatrixDescriptorCreate, cublasMpMatrixDescriptorDestroy, cublasMpGetVersion APIs.
Changed the return status of all functions to cublasMpStatus_t.
Removed cublasMpSetMathMode and cublasMpGetMathMode APIs.

Released: May 29, 2024

Released: April 4, 2024

Released: February 22, 2024

Released: January 11, 2024

Released: December 11, 2023