Release Notes#
This section includes significant changes, new features, performance improvements, and various issues. Unless noted, listed issues should not impact functionality. When functionality is impacted, we offer a work-around to avoid the issue (if available).
0.3.0#
The 3rd early access (EA) release of cuBLASDx library brings support for selected integer types, matrix multiplication with results stored in registers, and decoupling of computation types from input/output types.
New Features#
- Register fragment tensor support:
Added partitioning, predication and transformation tools.
Added copying and partitioning utilities for moving data from / into register file buffers.
- New GEMM Register APIs allowing for better performance.
Using registers to store
C
matrix (result matrix) allows to save on shared memory size and transfers by performing accumulation and input / output in registers.
Integral types support, including MMA instruction support for integral types.
- Compute precision decoupled from input precision.
A GEMM function can now accept any input and convert it in registers, saving on memory usage.
More robust shared memory management tools have been added, enabling construction of efficient pipelined execution.
- The library no longer statically asserts on whether the GEMM problem will fit in shared memory, due to:
A
andB
can alias or overlap themselves,C
can be entirely in register file, andinput precision can be arbitrary, and is not defined by compute precision.
Breaking changes#
Shared memory slicing and shared memory size utilities are no longer available as
BLAS
methods.Shared memory size trait has been removed from
BLAS
type.is_supported
trait has been removed, andis_supported_smem_restrict
,is_supported_rmem_restrict
added in its place.The library no longer asserts on whether a size will fit in shared memory, it’s the user’s responsibility now.
Resolved Issues#
cuBLASDx internal versioning definitions have been fixed: *
CUBLASDX_VERSION_MAJOR
is now defined as a multiple of 10000 inCUBLASDX_VERSION
instead of 1000, i.e.CUBLASDX_VERSION_MAJOR = CUBLASDX_VERSION / 10000
. *CUBLASDX_VERSION_MINOR
now can have a new maximum of two digits. * Since there was no major release there is no extra gap between this and previous version. * All definitions can be checked in the filecublasdx_version.hpp
.Missing traits have been added to follow
C++
metaprogramming convention
Known issues#
- It’s recommended to use the latest CUDA Toolkit and
NVCC
compiler. CUDA 12.4
has known edge cases producing incorrect FP8 MMA emulation code on SM90 with register APIs.CUDA 12.1
has known edge cases of crashing when passing previously named values into 3-value operators.e.g.
Alignment<varn_name_1, var_name_2, var_name_3>
may cause compilation hang, whileAlignment<8, 8, 8>
will always work.
- It’s recommended to use the latest CUDA Toolkit and
0.2.0#
The second early access (EA) release of cuBLASDx library brings tensor API, mixed precision, and performance improvements.
New Features#
All Device Extensions libraries are bundled together in a single package named nvidia-mathdx-24.08.0.tar.gz.
Added new tensor-based execute(…) API:
Improved performance and user-friendly interface for matrices thanks to support for CuTe tensor (cute::Tensor).
Helper methods for slicing shared memory between matrices.
Easy tensor creation thanks to get_layout_*() methods and cublasdx::make_tensor.
Suggestions for the best layouts for matrices in shared memory to improve the performance of both matrix multiplication and global-shared memory I/O operations.
Updated Introduction and Achieving High Performance.
cublasdx::copy for copying shared and global memory tensors.
Added support for mixed precision in Precision, especially long requested:
TensorFloat-32:
Precision<tfloat32_t, tfloat32_t, float>
, andPrecision<__half, __half, float>
.
Support for 8-bit floating-point matrices (
__nv_fp8_e4m3
, and__nv_fp8_e5m2
) and GEMM.Added Alignment operator to provide alignment information. It’s recommended to use 16-byte (128-bit) alignment for better performance.
TransposeMode operator is deprecated and replaced with Arrangement operator.
TransposeMode operator (or replacing it Arrangement) is not longer explicitly needed to define complete BLAS execution.
The default arrangement is
row_major
for A matrix,col_major
- B ,col_major
- C.The default transpose mode is now
transposed
for A matrix, andnon-transposed
for B.
Known Issues#
It’s recommended to use the latest CUDA Toolkit and
NVCC
compiler.
0.1.0#
The first early access (EA) release of cuBLASDx library.
New Features#
Support for general matrix multiply.
Tensor cores support for fp16, fp64, complex fp64 calculations.
Support for SM70 - SM90 CUDA architectures.
Multiple examples included.
Known Issues#
Since CUDA Toolkit 12.2 NVCC compiler in certain situation reports incorrect compilation error when
value_type
type of GEMM description type is used. The problematic code with possible workarounds are presented below:// Any GEMM description type using GEMM = decltype(Size<32 /* M */, 32 /* N */, 32 /* K */>() + Precision<double>() + Type<type::real>() + TransposeMode<t_mode /* A */, t_mode /* B */>() + Function<function::MM>() + SM<700>() + Block()); using type = typename GEMM::value_type; // compilation error // Workaround #1 using type = typename decltype(GEMM())::value_type; // Workaround #2 (used in cuBLASDx examples) template <typename T> using value_type_t = typename T::value_type; using type = value_type_t<GEMM>;