Release Notes#
cuBLASMp v0.5.0#
Released: June 16, 2025
Breaking changes#
- cuBLASMp has transitioned from using the Communication Abstraction Library (libcal) to using NCCL directly. This is a breaking change and requires changes to cuBLASMp initialization in the user application.
See Migrating from CAL to NCCL for steps to transition the application from libcal to NCCL.
The steps to initialize cuBLASMp with NCCL are described in NCCL Initialization.
libcal documentation page is still available at CAL Initialization (Legacy) but it is only applicable to cuBLASMp versions older than 0.5.0.
New features#
Added support for NT and TT cases in AllGather + GEMM and GEMM + ReduceScatter variants of the cublasMpMatmul API.
Added support for epilogues in the cublasMpMatmul API. See cublasMpMatmulEpilogue_t for more details. Epilogues can be set using cublasMpMatmulDescriptorAttributeSet.
Added support for tensor-wide scaling for cublasMpMatmul API with FP8 inputs (Compute Capability 9.0+). It can be used via cublasMpMatmulDescriptorAttributeSet (see cublasMpMatmulDescriptorAttribute_t and cublasMpMatmulMatrixScale_t).
cuBLASMp v0.4.0#
Released: March 10, 2025
Added support for NVIDIA Blackwell GPU architecture.
Added support for GEMM + AllReduce using the cublasMpMatmul API.
Added support for NN cases in AllGather + GEMM and GEMM + ReduceScatter variants of the cublasMpMatmul API.
Bug fixes.
Deprecated functionality: atomic Matmul with multicast communication (i.e., cublasMpMatmul with cublasMpMatmulAlgoType_t = CUBLASMP_MATMUL_ALGO_TYPE_ATOMIC_MULTICAST). The functional implementation remains available but it is not performant and will be removed in a future release.
cuBLASMp v0.3.1#
Released: December 10, 2024
Add option to set the amount of SMs to be used for communication (currently relevant only for Atomic GEMM + ReduceScatter).
Decrease workspace size requirement in TP overlap GEMMs.
Remove extra synchronization in TP overlap GEMMs.
Allow C matrix to be null when beta is 0.
Fix GEMM implementation for complex types with
transA
/transB
beingCUBLAS_OP_T
.
cuBLASMp v0.3.0#
Released: November 4, 2024
Added new cublasMpMatmul API.
Added GEMM/Matmul fast paths required for tensor parallelism (communication-computation-overlapped AllGather+GEMM and GEMM+ReduceScatter).
Added FP8 support.
Added cublasMpStatus_t.
Added cublasMpStreamSet and cublasMpStreamGet.
Added cublasMpMatrixDescriptorInit API to allow reusing matrix descriptors.
Added NVSHMEM dependency.
Added Matmul helper APIs: cublasMpMatmulDescriptorCreate, cublasMpMatmulDescriptorDestroy, cublasMpMatmulDescriptorAttributeSet, cublasMpMatmulDescriptorAttributeGet.
Dropped support for CUDA 11.x.
Bug fixes.
Breaking changes#
Removed the cublasMpHandle_t parameter from cublasMpGridCreate, cublasMpGridDestroy, cublasMpMatrixDescriptorCreate, cublasMpMatrixDescriptorDestroy, cublasMpGetVersion APIs.
Changed the return status of all functions to cublasMpStatus_t.
Removed
cublasMpSetMathMode
andcublasMpGetMathMode
APIs.
cuBLASMp v0.2.1#
Released: May 29, 2024
Added mixed and lower precision support.
Bug fixes.
cuBLASMp v0.2.0#
Released: April 4, 2024
Improved performance of cublasMpGemm.
Bug fixes.
cuBLASMp v0.1.2#
Released: February 22, 2024
Added cublasMpGeadd.
Added cublasMpTradd.
Improved performance of cublasMpGemm.
Improved performance of cublasMpTrsm.
cuBLASMp v0.1.1#
Released: January 11, 2024
Added
rsrc
andcsrc
support.Added cublasMpGemr2D.
Added cublasMpTrmr2D.
cuBLASMp v0.1.0#
Released: December 11, 2023
Early access release.
This release focuses on functionality.