Release Notes¶

This section includes significant changes, new features, performance improvements, and various issues. Unless noted, listed issues should not impact functionality. When functionality is impacted, we offer a work-around to avoid the issue (if available).

0.2.0¶

The second early access (EA) release of cuBLASDx library brings tensor API, mixed precision, and performance improvements.

New Features¶

All Device Extensions libraries are bundled together in a single package named nvidia-mathdx-24.08.0.tar.gz.
Added new tensor-based execute(…) API:
- Improved performance and user-friendly interface for matrices thanks to support for CuTe tensor (cute::Tensor).
- Helper methods for slicing shared memory between matrices.
- Easy tensor creation thanks to get_layout_*() methods and cublasdx::make_tensor.
- Suggestions for the best layouts for matrices in shared memory to improve the performance of both matrix multiplication and global-shared memory I/O operations.
- Updated Introduction and Achieving High Performance.
cublasdx::copy for copying shared and global memory tensors.
Added support for mixed precision in Precision, especially long requested:
- TensorFloat-32: Precision<tfloat32_t, tfloat32_t, float>, and
- Precision<__half, __half, float>.
Support for 8-bit floating-point matrices (__nv_fp8_e4m3, and __nv_fp8_e5m2) and GEMM.
Added Alignment operator to provide alignment information. It’s recommended to use 16-byte (128-bit) alignment for better performance.
TransposeMode operator is deprecated and replaced with Arrangement operator.
TransposeMode operator (or replacing it Arrangement) is not longer explicitly needed to define complete BLAS execution.
- The default arrangement is row-major for A matrix, col-major - B , col-major - C.
- The default transpose mode is now transposed for A matrix, and non-transposed for B.

Known Issues¶

It’s recommended to use the latest CUDA Toolkit and NVCC compiler.

0.1.0¶

The first early access (EA) release of cuBLASDx library.

New Features¶

Support for general matrix multiply.
- Tensor cores support for fp16, fp64, complex fp64 calculations.
Support for SM70 - SM90 CUDA architectures.
Multiple examples included.

Known Issues¶

Since CUDA Toolkit 12.2 NVCC compiler in certain situation reports incorrect compilation error when value_type type of GEMM description type is used. The problematic code with possible workarounds are presented below:

// Any GEMM description type
using GEMM = decltype(Size<32 /* M */, 32 /* N */, 32 /* K */>()
            + Precision<double>()
            + Type<type::real>()
            + TransposeMode<t_mode /* A */, t_mode /* B */>()
            + Function<function::MM>()
            + SM<700>()
            + Block());
using type = typename GEMM::value_type; // compilation error

// Workaround #1
using type = typename decltype(GEMM())::value_type;

// Workaround #2 (used in cuBLASDx examples)
template <typename T>
using value_type_t = typename T::value_type;

using type = value_type_t<GEMM>;