Shared Memory Management#

cuBLASDx’s GEMM requires input matrices to be in shared memory with an option for C matrix to be partitioned between threads. The use of shared memory imposes certain rules (alignment) and limitations (limited shared memory space). cuBLASDx provides shared memory management tools to make it easier for all users to operate on shared memory.

Warning

Starting with cuBLASDx 0.3.0 the ::shared_memory_size traits and ::shared_memory_size() methods no longer exist and have been replaced with new APIs described in this chapter.

Shared Storage Size Utilities#

CUDA requires the user to specify amount of used dynamic shared memory upfront during kernel launch. For executing a GEMM such size will be dependent on:

  1. Problem size (Size<M, N, K>).

  2. Chosen input type (since cuBLASDx 0.3.0 input precision can be different than compute precision, see Precision Operator).

  3. Chosen alignments of matrices (Alignment<A, B, C>).

  4. API which will be used to execute GEMM (register or shared memory).

Since this information is neither contained in the BLAS type (since input precision is decoupled from compute precision) nor in the tensors (they are missing alignment information), helper utilities have been created to help with streamlining the process:

// Shared memory API
template<class BLAS, class AValueType = typename BLAS::a_value_type,
                     class BValueType = typename BLAS::b_value_type,
                     class CValueType = typename BLAS::c_value_type,
                     class ALayout, class BLayout, class CLayout>
constexpr unsigned get_shared_storage_size(ALayout const& a_layout, BLayout const& b_layout, CLayout const& c_layout);

template<class BLAS, class AValueType = typename BLAS::a_value_type,
                     class BValueType = typename BLAS::b_value_type,
                     class CValueType = typename BLAS::c_value_type>
__host__ __device__ __forceinline__ constexpr unsigned
get_shared_storage_size(unsigned lda, unsigned ldb, unsigned ldc);

// Register API
template<class BLAS, class AValueType = typename BLAS::a_value_type,
                     class BValueType = typename BLAS::b_value_type,
                     class ALayout, class BLayout>
constexpr unsigned get_shared_storage_size_ab(ALayout const& a_layout, BLayout const& b_layout);

template<class BLAS, class AValueType = typename BLAS::a_value_type,
                     class BValueType = typename BLAS::b_value_type>
constexpr unsigned get_shared_storage_size_ab(unsigned lda, unsigned ldb);

Resulting value is size of the shared memory in bytes required to allocate input and output matrices, and perform computations.

The value is determined by size of value types, matrix sizes and alignments.

Note that BLAS::get_shared_memory_size accepts arbitrary CuTe layouts. Class ALayout, BLayout and CLayout in the above function prototype could be either cute::Layout or cute::ComposedLayout.

These functions can be used as follows:

// Shared API - Regular execution
auto shared_size = cublasdx::get_shared_storage_size<BLAS>();

// Shared API - Decoupled input precision execution
auto shared_size = cublasdx::get_shared_storage_size<BLAS, InputTypeA, InputTypeB, InputTypeC>();


// Shared API - Regular execution
auto shared_size = cublasdx::get_shared_storage_size_ab<BLAS>();

// Shared API - Decoupled input precision execution
auto shared_size = cublasdx::get_shared_storage_size_ab<BLAS, InputTypeA, InputTypeB, InputTypeC>();

For special cases cuBLASDx offers a shared memory size calculator:

shared_storage_calc make_shared_storage_calc();

it exposes the following API:

template<class Layout>
shared_storage_calculator& add(unsigned alignment, unsigned elem_size, const Layout& layout);

__host__ __device__ __forceinline__ constexpr
shared_storage_calculator& add(unsigned alignment, unsigned matrix_size_bytes);

__host__ __device__ __forceinline__ constexpr
shared_storage_calculator& add(unsigned alignment, unsigned elem_size, unsigned num_elements);

It can be used to calculate shared memory requirements for pipelined register API execution:

// 2 Stage pipelined register memory execution
auto shared_memory_size =
cublasdx::make_shared_storage_calc()
.add(cublasdx::alignment_of_v_a<BLAS>, sizeof(AInputType), BLAS::suggest_layout_smem_a())
.add(cublasdx::alignment_of_v_b<BLAS>, sizeof(BInputType), BLAS::suggest_layout_smem_b())
.add(cublasdx::alignment_of_v_a<BLAS>, sizeof(AInputType), BLAS::suggest_layout_smem_a())
.add(cublasdx::alignment_of_v_b<BLAS>, sizeof(BInputType), BLAS::suggest_layout_smem_b())
.get();

Shared Memory Slicing#

Warning

Starting with cuBLASDx 0.3.0 the ::slice_shared_memory() methods no longer exist and have been moved to APIs described below.

The shared memory slicing free functions work with BLAS if is_complete_blas_execution Trait is true for it.

// #1 Slice shared memory with default leading dimensions and default matrices layouts
template<class BLAS, class AValueType = typename BLAS::a_value_type,
                     class BValueType = typename BLAS::b_value_type,
                     class CValueType = typename BLAS::c_value_type>
cute::tuple<AValueType*, BValueType*, CValueType*>
cublasdx::slice_shared_memory(void* smem_ptr)

// #2: Slice shared memory with dynamic leading dimensions
template<class BLAS, class AValueType = typename BLAS::a_value_type,
                     class BValueType = typename BLAS::b_value_type,
                     class CValueType = typename BLAS::c_value_type>
cute::tuple<AValueType*, BValueType*, CValueType*>
cublasdx::slice_shared_memory(void* smem_ptr,
                              unsigned int lda,
                              unsigned int ldb,
                              unsigned int ldc)

// #3: Slice shared memory with custom matrices layouts
template<class BLAS, class AValueType = typename BLAS::a_value_type,
                     class BValueType = typename BLAS::b_value_type,
                     class CValueType = typename BLAS::c_value_type,
                     class ALayout, class BLayout, class CLayout>
cute::tuple<AValueType*, BValueType*, CValueType*>
cublasdx::slice_shared_memory(void* smem_ptr,
                              ALayout a_layout,
                              BLayout b_layout,
                              CLayout c_layout)

Method cublasdx::slice_shared_memory(...) slices shared memory into chunks, one for each matrix.

The return values are pointers to the first element of the slices for A, B and C matrices. They follow the alignments in BLAS description and at the same time, are not over-aligned, i.e., bytes between two slices are less than the alignments.

Note that BLAS::slice_shared_memory accepts arbitrary CuTe layouts. Class ALayout, BLayout and CLayout in the above function prototype could be either cute::Layout or cute::ComposedLayout.

Example

using BLAS = decltype(...);

extern __shared__ __align__(16) char smem[];

// use structured binding
auto [smem_a, smem_b, smem_c] = BLAS::slice_shared_memory();

// or
auto smem_slices = BLAS::slice_shared_memory();
auto smem_a = cute::get<0>(sliced_smem);
auto smem_b = cute::get<1>(sliced_smem);
auto smem_c = cute::get<2>(sliced_smem);

cuBLASDx also offers an advanced generic slicing API allowing for arbitrary number of input matrices:

// memory_descriptor is just a cute::tuple<unsigned, unsigned> containing:
// 1. Cosize of tensor
// 2. Alignment of tensor

template<class ... PointerTypes, class ... Tuples>
__host__ __device__ __forceinline__ auto
slice_shared_memory_generic(void* smem, Tuples const& ... memory_descriptors);

It can be used in the following way to properly slice shared memory for 2-stage pipelined GEMM execution:

auto [smem_a, smem_b, smem_a_n, smem_b_n] =
 cublasdx::slice_shared_memory_generic<AInputType, BInputType, AInputType, BInputType>(
     smem,
     cute::make_tuple(cublasdx::cosize(BLAS::suggest_layout_smem_a()), cublasdx::alignment_of_v_a<BLAS>),
     cute::make_tuple(cublasdx::cosize(BLAS::suggest_layout_smem_b()), cublasdx::alignment_of_v_b<BLAS>),
     cute::make_tuple(cublasdx::cosize(BLAS::suggest_layout_smem_a()), cublasdx::alignment_of_v_a<BLAS>),
     cute::make_tuple(cublasdx::cosize(BLAS::suggest_layout_smem_b()), cublasdx::alignment_of_v_b<BLAS>)
 );