Release Notes#
cuBLASMp v0.4.0#
Added support for NVIDIA Blackwell GPU architecture.
Added support for GEMM + AllReduce using the cublasMpMatmul API.
Added support for NN cases in AllGather + GEMM and GEMM + ReduceScatter variants of the cublasMpMatmul API.
Bug fixes.
Deprecated functionality: atomic Matmul with multicast communication (i.e., cublasMpMatmul with cublasMpMatmulAlgoType_t = CUBLASMP_MATMUL_ALGO_TYPE_ATOMIC_MULTICAST). The functional implementation remains available but it is not performant and will be removed in a future release.
cuBLASMp v0.3.1#
Add option to set the amount of SMs to be used for communication (currently relevant only for Atomic GEMM + ReduceScatter).
Decrease workspace size requirement in TP overlap GEMMs.
Remove extra synchronization in TP overlap GEMMs.
Allow C matrix to be null when beta is 0.
Fix GEMM implementation for complex types with
transA
/transB
beingCUBLAS_OP_T
.
cuBLASMp v0.3.0#
Added new cublasMpMatmul API.
Added GEMM/Matmul fast paths required for tensor parallelism (communication-computation-overlapped AllGather+GEMM and GEMM+ReduceScatter).
Added FP8 support.
Added cublasMpStatus_t.
Added cublasMpStreamSet and cublasMpStreamGet.
Added cublasMpMatrixDescriptorInit API to allow reusing matrix descriptors.
Added NVSHMEM dependency.
Added Matmul helper APIs: cublasMpMatmulDescriptorCreate, cublasMpMatmulDescriptorDestroy, cublasMpMatmulDescriptorAttributeSet, cublasMpMatmulDescriptorAttributeGet.
Dropped support for CUDA 11.x.
Bug fixes.
Breaking changes#
Removed the cublasMpHandle_t parameter from cublasMpGridCreate, cublasMpGridDestroy, cublasMpMatrixDescriptorCreate, cublasMpMatrixDescriptorDestroy, cublasMpGetVersion APIs.
Changed the return status of all functions to cublasMpStatus_t.
Removed
cublasMpSetMathMode
andcublasMpGetMathMode
APIs.
cuBLASMp v0.2.1#
Added mixed and lower precision support.
Bug fixes.
cuBLASMp v0.2.0#
Improved performance of cublasMpGemm.
Bug fixes.
cuBLASMp v0.1.2#
Added cublasMpGeadd.
Added cublasMpTradd.
Improved performance of cublasMpGemm.
Improved performance of cublasMpTrsm.
cuBLASMp v0.1.1#
Added
rsrc
andcsrc
support.Added cublasMpGemr2D.
Added cublasMpTrmr2D.
cuBLASMp v0.1.0#
Early access release.
This release focuses on functionality.