Release Notes#

This section includes significant changes, new features, performance improvements, and various issues. Unless noted, listed issues should not impact functionality. When functionality is impacted, we offer a work-around to avoid the issue (if available).

0.3.0#

The 3rd early access (EA) release of cuBLASDx library brings support for selected integer types, matrix multiplication with results stored in registers, and decoupling of computation types from input/output types.

New Features#

  • Register fragment tensor support:
    • Added partitioning, predication and transformation tools.

    • Added copying and partitioning utilities for moving data from / into register file buffers.

  • New GEMM Register APIs allowing for better performance.
    • Using registers to store C matrix (result matrix) allows to save on shared memory size and transfers by performing accumulation and input / output in registers.

  • Integral types support, including MMA instruction support for integral types.

  • Compute precision decoupled from input precision.
    • A GEMM function can now accept any input and convert it in registers, saving on memory usage.

  • More robust shared memory management tools have been added, enabling construction of efficient pipelined execution.

  • The library no longer statically asserts on whether the GEMM problem will fit in shared memory, due to:
    • A and B can alias or overlap themselves,

    • C can be entirely in register file, and

    • input precision can be arbitrary, and is not defined by compute precision.

Breaking changes#

  • Shared memory slicing and shared memory size utilities are no longer available as BLAS methods.

  • Shared memory size trait has been removed from BLAS type.

  • is_supported trait has been removed, and is_supported_smem_restrict, is_supported_rmem_restrict added in its place.

  • The library no longer asserts on whether a size will fit in shared memory, it’s the user’s responsibility now.

Resolved Issues#

  • cuBLASDx internal versioning definitions have been fixed: * CUBLASDX_VERSION_MAJOR is now defined as a multiple of 10000 in CUBLASDX_VERSION instead of 1000, i.e. CUBLASDX_VERSION_MAJOR = CUBLASDX_VERSION / 10000. * CUBLASDX_VERSION_MINOR now can have a new maximum of two digits. * Since there was no major release there is no extra gap between this and previous version. * All definitions can be checked in the file cublasdx_version.hpp.

  • Missing traits have been added to follow C++ metaprogramming convention

Known issues#

  • It’s recommended to use the latest CUDA Toolkit and NVCC compiler.
    • CUDA 12.4 has known edge cases producing incorrect FP8 MMA emulation code on SM90 with register APIs.

    • CUDA 12.1 has known edge cases of crashing when passing previously named values into 3-value operators.
      • e.g. Alignment<varn_name_1, var_name_2, var_name_3> may cause compilation hang, while Alignment<8, 8, 8> will always work.

0.2.0#

The second early access (EA) release of cuBLASDx library brings tensor API, mixed precision, and performance improvements.

New Features#

Known Issues#

  • It’s recommended to use the latest CUDA Toolkit and NVCC compiler.

0.1.0#

The first early access (EA) release of cuBLASDx library.

New Features#

  • Support for general matrix multiply.

    • Tensor cores support for fp16, fp64, complex fp64 calculations.

  • Support for SM70 - SM90 CUDA architectures.

  • Multiple examples included.

Known Issues#

  • Since CUDA Toolkit 12.2 NVCC compiler in certain situation reports incorrect compilation error when value_type type of GEMM description type is used. The problematic code with possible workarounds are presented below:

    // Any GEMM description type
    using GEMM = decltype(Size<32 /* M */, 32 /* N */, 32 /* K */>()
                + Precision<double>()
                + Type<type::real>()
                + TransposeMode<t_mode /* A */, t_mode /* B */>()
                + Function<function::MM>()
                + SM<700>()
                + Block());
    using type = typename GEMM::value_type; // compilation error
    
    // Workaround #1
    using type = typename decltype(GEMM())::value_type;
    
    // Workaround #2 (used in cuBLASDx examples)
    template <typename T>
    using value_type_t = typename T::value_type;
    
    using type = value_type_t<GEMM>;