Release Notes#
cuBLASMp v0.7.0#
Released: November 24, 2025
New features#
Added FP32 emulation support.
Added FP4 output support.
Added
CUBLASMP_MATMUL_ALGO_TYPE_NO_OVERLAPalgorithm type for cublasMpMatmul API. These algorithms may perform better on non-NVLINK multi-node systems.Added
CUBLASMP_DISABLE_NVSHMEMenvironment variable to disable NVSHMEM usage at runtime. When set to 1, cuBLASMp will not initialize or use NVSHMEM, which can be useful for debugging or on platforms where NVSHMEM is not supported.
Bug fixes#
Fixed epilogue related issues in AllGather+GEMM and GEMM+ReduceScatter.
Fixed a bug that prevented NVSHMEM from being finalized.
cuBLASMp v0.6.0#
Released: September 9, 2025
Added NVFP4 support (Compute Capability 10.0+).
Added support for block scaling for cublasMpMatmul API with FP4 and FP8 inputs (Compute Capability 9.0+). It can be used via cublasMpMatmulDescriptorAttributeSet (see cublasMpMatmulDescriptorAttribute_t and cublasMpMatmulMatrixScale_t).
Added support for Compute Capability 7.5 (Turing) with CUDA 13 builds.
Bug fixes#
Fixed redistribution functions (cublasMpGemr2D, cublasMpTrmr2D) to work with unified memory.
Fixed Matmul when C matrix is nullptr and D matrix type is BF16 or FP32.
Fixed a bug in multicast-based algorithms when used with communicator different from the one used for initialization.
cuBLASMp v0.5.1#
Released: August 11, 2025
Added support for CUDA 13 on devices with Compute Capability 8.0 (Ampere) and above.
cuBLASMp v0.5.0#
Released: June 16, 2025
Breaking changes#
- cuBLASMp has transitioned from using the Communication Abstraction Library (libcal) to using NCCL directly. This is a breaking change and requires changes to cuBLASMp initialization in the user application.
See Migrating from CAL to NCCL for steps to transition the application from libcal to NCCL.
The steps to initialize cuBLASMp with NCCL are described in NCCL Initialization.
libcal documentation page is still available at CAL Initialization (Legacy) but it is only applicable to cuBLASMp versions older than 0.5.0.
New features#
Added support for NT and TT cases in AllGather + GEMM and GEMM + ReduceScatter variants of the cublasMpMatmul API.
Added support for epilogues in the cublasMpMatmul API. See cublasMpMatmulEpilogue_t for more details. Epilogues can be set using cublasMpMatmulDescriptorAttributeSet.
Added support for tensor-wide scaling for cublasMpMatmul API with FP8 inputs (Compute Capability 9.0+). It can be used via cublasMpMatmulDescriptorAttributeSet (see cublasMpMatmulDescriptorAttribute_t and cublasMpMatmulMatrixScale_t).
cuBLASMp v0.4.0#
Released: March 10, 2025
Added support for NVIDIA Blackwell GPU architecture.
Added support for GEMM + AllReduce using the cublasMpMatmul API.
Added support for NN cases in AllGather + GEMM and GEMM + ReduceScatter variants of the cublasMpMatmul API.
Bug fixes.
Deprecated functionality: atomic Matmul with multicast communication (i.e., cublasMpMatmul with cublasMpMatmulAlgoType_t = CUBLASMP_MATMUL_ALGO_TYPE_ATOMIC_MULTICAST). The functional implementation remains available but it is not performant and will be removed in a future release.
cuBLASMp v0.3.1#
Released: December 10, 2024
Add option to set the amount of SMs to be used for communication (currently relevant only for Atomic GEMM + ReduceScatter).
Decrease workspace size requirement in TP overlap GEMMs.
Remove extra synchronization in TP overlap GEMMs.
Allow C matrix to be null when beta is 0.
Fix GEMM implementation for complex types with
transA/transBbeingCUBLAS_OP_T.
cuBLASMp v0.3.0#
Released: November 4, 2024
Added new cublasMpMatmul API.
Added GEMM/Matmul fast paths required for tensor parallelism (communication-computation-overlapped AllGather+GEMM and GEMM+ReduceScatter).
Added FP8 support.
Added cublasMpStatus_t.
Added cublasMpStreamSet and cublasMpStreamGet.
Added cublasMpMatrixDescriptorInit API to allow reusing matrix descriptors.
Added NVSHMEM dependency.
Added Matmul helper APIs: cublasMpMatmulDescriptorCreate, cublasMpMatmulDescriptorDestroy, cublasMpMatmulDescriptorAttributeSet, cublasMpMatmulDescriptorAttributeGet.
Dropped support for CUDA 11.x.
Bug fixes.
Breaking changes#
Removed the cublasMpHandle_t parameter from cublasMpGridCreate, cublasMpGridDestroy, cublasMpMatrixDescriptorCreate, cublasMpMatrixDescriptorDestroy, cublasMpGetVersion APIs.
Changed the return status of all functions to cublasMpStatus_t.
Removed
cublasMpSetMathModeandcublasMpGetMathModeAPIs.
cuBLASMp v0.2.1#
Released: May 29, 2024
Added mixed and lower precision support.
Bug fixes.
cuBLASMp v0.2.0#
Released: April 4, 2024
Improved performance of cublasMpGemm.
Bug fixes.
cuBLASMp v0.1.2#
Released: February 22, 2024
Added cublasMpGeadd.
Added cublasMpTradd.
Improved performance of cublasMpGemm.
Improved performance of cublasMpTrsm.
cuBLASMp v0.1.1#
Released: January 11, 2024
Added
rsrcandcsrcsupport.Added cublasMpGemr2D.
Added cublasMpTrmr2D.
cuBLASMp v0.1.0#
Released: December 11, 2023
Early access release.
This release focuses on functionality.