Release Notes#

This section summarizes significant changes, new features, performance improvements, and known issues. Unless otherwise noted, listed issues should not impact functionality. When functionality is affected, a workaround is provided if available.

0.4.1#

Patch release adding support for CUDA 13

New Features#

Support for CUDA 13.0
Deprecation of CUDA 11.X
Support for Thor SM renaming from sm_101 to sm_110 starting from CUDA 13.0 * Note: For CUDA 12.9 and older releases, Thor stays labeled as sm_101

0.4.0#

Release introducing wider support for suggested layouts and basic Blackwell support.

New Features#

Suggested layouts support wider range of problem sizes, threadblocks and instructions introducing analytical swizzling heuristics
Both suggested and non-suggested layouts now generate ld.matrix and st.matrix for either 1 (x1), 2 (x2), or 4 (x4) matrices
Support for PTX 8.7 superMMA instructions and fma.f32x2 fused multiply-add instructions.
SM100, SM101, SM120 support
SM103 and SM121 experimental support

Breaking changes#

The entire shared memory slicing API has been refactored and generalized to support all Dx libraries.
SM72 has been deprecated

Known issues#

CUDA 12.8.0, 12.8.1 and 12.9.0 have been known to miscompile cuBLASDx 0.3.1 and 0.4.X code with high register pressure when

Any of the computation types is fp8_e5m2, fp8_e4m3, fp16, bf16, or int8

Any of M, N or K (Size Operator) is not a multiple of 16

Custom static leading dimension is used (LeadingDimension Operator)

These code corruptions may manifest as either incorrect results or illegal memory access errors.

To highlight this issue cuBLASDx 0.4.X will overprotectively hard fail if these conditions are met.

If you are using cuBLASDx, and this happens to your code, you can:

Update to the latest CUDA Toolkit and NVCC compiler (12.9.1 is known to work)

define the CUBLASDX_IGNORE_NVBUG_5218000_ASSERT macro to ignore these assertions and verify correctness of the results.

if the case is indeed affected by the bug, adding the -Xptxas -O1 flag to the compilation command will limit PTX optimization phase and produce correct binary, although potentially slower.

0.3.1#

Minor release with quality of life changes and minor heuristic improvements.

New Features#

Improved usability, readability and reference in device_gemm_performance example.
Improved GEMM heuristics for non-suggested executions.
Improved safety of mixing suggested and non-suggested layouts and partitioners in execution.

0.3.0#

The 3rd early access (EA) release of cuBLASDx library brings support for selected integer types, matrix multiplication with results stored in registers, and decoupling of computation types from input/output types.

New Features#

Register fragment tensor support:
- Added partitioning, predication and transformation tools.
- Added copying and partitioning utilities for moving data from / into register file buffers.
New GEMM Register APIs allowing for better performance.
- Using registers to store C matrix (result matrix) allows to save on shared memory size and transfers by performing accumulation and input / output in registers.
Integral types support, including MMA instruction support for integral types.
Compute precision decoupled from input precision.
- A GEMM function can now accept any input and convert it in registers, saving on memory usage.
More robust shared memory management tools have been added, enabling construction of efficient pipelined execution.
The library no longer statically asserts on whether the GEMM problem will fit in shared memory, due to:
- A and B can alias or overlap themselves,
- C can be entirely in register file, and
- input precision can be arbitrary, and is not defined by compute precision.

Breaking changes#

Shared memory slicing and shared memory size utilities are no longer available as BLAS methods.
Shared memory size trait has been removed from BLAS type.
is_supported trait has been removed, and is_supported_smem_restrict, is_supported_rmem_restrict added in its place.
The library no longer asserts on whether a size will fit in shared memory, it’s the user’s responsibility now.

Resolved Issues#

cuBLASDx internal versioning definitions have been fixed: * CUBLASDX_VERSION_MAJOR is now defined as a multiple of 10000 in CUBLASDX_VERSION instead of 1000, i.e. CUBLASDX_VERSION_MAJOR = CUBLASDX_VERSION / 10000. * CUBLASDX_VERSION_MINOR now can have a new maximum of two digits. * Since there was no major release there is no extra gap between this and previous version. * All definitions can be checked in the file cublasdx_version.hpp.
Missing traits have been added to follow C++ metaprogramming convention

Known issues#

It’s recommended to use the latest CUDA Toolkit and NVCC compiler.
- CUDA 12.4 has known edge cases producing incorrect FP8 MMA emulation code on SM90 with register APIs.
- CUDA 12.1 has known edge cases of crashing when passing previously named values into 3-value operators.
  
  e.g. Alignment<varn_name_1, var_name_2, var_name_3> may cause compilation hang, while Alignment<8, 8, 8> will always work.

0.2.0#

The second early access (EA) release of cuBLASDx library brings tensor API, mixed precision, and performance improvements.

New Features#

All Device Extensions libraries are bundled together in a single package named nvidia-mathdx-24.08.0.tar.gz.
Added new tensor-based execute(…) API:
- Improved performance and user-friendly interface for matrices thanks to support for CuTe tensor (cute::Tensor).
- Helper methods for slicing shared memory between matrices.
- Easy tensor creation thanks to get_layout_*() methods and cublasdx::make_tensor.
- Suggestions for the best layouts for matrices in shared memory to improve the performance of both matrix multiplication and global-shared memory I/O operations.
- Updated Introduction and Achieving High Performance.
cublasdx::copy for copying shared and global memory tensors.
Added support for mixed precision in Precision, especially long requested:
- TensorFloat-32: Precision<tfloat32_t, tfloat32_t, float>, and
- Precision<__half, __half, float>.
Support for 8-bit floating-point matrices (__nv_fp8_e4m3, and __nv_fp8_e5m2) and GEMM.
Added Alignment operator to provide alignment information. It’s recommended to use 16-byte (128-bit) alignment for better performance.
TransposeMode operator is deprecated and replaced with Arrangement operator.
TransposeMode operator (or replacing it Arrangement) is not longer explicitly needed to define complete BLAS execution.
- The default arrangement is row_major for A matrix, col_major - B , col_major - C.
- The default transpose mode is now transposed for A matrix, and non-transposed for B.

Known Issues#

It’s recommended to use the latest CUDA Toolkit and NVCC compiler.

0.1.0#

The first early access (EA) release of cuBLASDx library.

New Features#

Support for general matrix multiply.
- Tensor cores support for fp16, fp64, complex fp64 calculations.
Support for SM70 - SM90 CUDA architectures.
Multiple examples included.

Known Issues#

Since CUDA Toolkit 12.2, the NVCC compiler in certain situations reports an incorrect compilation error when the value_type type of a GEMM description type is used. The problematic code with possible workarounds is presented below:

// Any GEMM description type
using GEMM = decltype(Size<32 /* M */, 32 /* N */, 32 /* K */>()
            + Precision<double>()
            + Type<type::real>()
            + TransposeMode<t_mode /* A */, t_mode /* B */>()
            + Function<function::MM>()
            + SM<700>()
            + Block());
using type = typename GEMM::value_type; // compilation error

// Workaround #1
using type = typename decltype(GEMM())::value_type;

// Workaround #2 (used in cuBLASDx examples)
template <typename T>
using value_type_t = typename T::value_type;

using type = value_type_t<GEMM>;