Other Methods

Shared Memory Slicing

The shared memory slicing methods are available if is_complete_blas_execution Trait is true.

// Note: When using NVRTC std::tuple is replaced with cuda::std::tuple

// #1
std::tuple<BLAS::a_value_type*, BLAS::b_value_type*, BLAS::c_value_type*> BLAS::slice_shared_memory(char* smem_ptr)

// #2: Slice shared memory with dynamic leading dimensions
std::tuple<BLAS::a_value_type*, BLAS::b_value_type*, BLAS::c_value_type*>
BLAS::slice_shared_memory(char* smem_ptr,
                          unsigned int lda,
                          unsigned int ldb,
                          unsigned int ldc)

// #3: Slice shared memory with custom matrices layouts
template<class ALayout, class BLayout, class CLayout>
std::tuple<BLAS::a_value_type*, BLAS::b_value_type*, BLAS::c_value_type*>
BLAS::slice_shared_memory(char* smem_ptr,
                          ALayout a_layout,
                          BLayout b_layout,
                          CLayout c_layout)

Method slice_shared_memory(...) slices shared memory into chunks, one for each matrix.

The return values are pointers to the first element of the slices for A, B and C matrices. They follow the alignments in BLAS description and at the same time, are not over-aligned, i.e., bytes between two slices are less than the alignments. Same as Shared Memory Size Trait, the result depends on value types and matrix sizes.

Note that BLAS::slice_shared_memory accepts arbitrary CuTe layouts. Class ALayout, BLayout and CLayout in the above function prototype could be either cute::Layout or cute::ComposedLayout.

Example

using BLAS = decltype(...);

extern __shared__ __align__(16) char smem[];

// use structured binding
auto [smem_a, smem_b, smem_c] = BLAS::slice_shared_memory();

// or
auto smem_slices = BLAS::slice_shared_memory();
auto smem_a = std::get<0>(sliced_smem);
auto smem_b = std::get<1>(sliced_smem);
auto smem_c = std::get<2>(sliced_smem);

Warning

When using NVRTC, libcu++ (libcudacxx) is used and std:: is replaced with cuda::std::.

using BLAS = decltype(...);

extern __shared__ __align__(16) char smem[];

// Structured bindings support for cuda::std::tuple was added in 2.1.0 version of libcu++
#if _LIBCUDACXX_CUDA_API_VERSION >= 2001000
auto [smem_a, smem_b, smem_c] = BLAS::slice_shared_memory();
#else
auto smem_slices = BLAS::slice_shared_memory();
auto smem_a = std::get<0>(sliced_smem);
auto smem_b = std::get<1>(sliced_smem);
auto smem_c = std::get<2>(sliced_smem);
#endif

Get Memory Layout

BLAS::get_layout_<gmem/smem>_<a/b/c>() function fetches global memory or shared memory CuTe layout for matrix A, B or C, determined by matrix sizes, arrangement, and leading dimensions. For shared memory layouts the leading dimensions, if not specified explicitly through a parameter, will be inferred from the leading dimensions operator. For global memory layouts custom leading dimensions must be passed either through a static or dynamic integral type, or otherwise they will be inferred from matrix sizes.

__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c();

__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_a();
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_b();
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_c();

// Overloads for specifying the leading dimensions statically during compilation time.
// integral_type can be either signed or unsigned integer type and integral_value follow
// this specification.
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const std::integral_constant<integral_type, lda>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const std::integral_constant<integral_type, ldb>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const std::integral_constant<integral_type, ldc>);

// Overloads for specifying the leading dimensions during the execution time.

__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const unsigned int lda);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const unsigned int ldb);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const unsigned int ldc);

__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_a(const unsigned int lda);
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_b(const unsigned int ldb);
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_c(const unsigned int ldc);

BLAS::get_layout_<gmem/smem>_<a/b/c>() returns a combination of memory tag (global or shared) and the layout (cute::Layout) for matrix A, B or C which can be directly passed to cublasdx::make_tensor to create a tensor.

BLAS::get_layout_<gmem/smem>_<a/b/c>() returns a matrix layout corresponding to the order set via Arrangement operator. For example, if the order for A matrix was set to cublasdx::row-major, the returned layout follows the row-major order.

In case of dynamic leading dimensions provided by user during execution time, the function accepts the leading dimension as an argument, see the example below.

Example

using BLAS = decltype(...)

extern __shared__ __align__(16) char smem[];

// a, b, c are pointers to global memory of input matrices A and B and output matrix C
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c());

auto [smem_a, smem_b, smem_c] = BLAS::slice_shared_memory(smem);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b());
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c());

// With leading dimensions specified during the compilation time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(std::integral_constant<int, lda>{}));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(std::integral_constant<int, ldb>{}));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(std::integral_constant<int, ldc>{}));

// With leading dimensions specified during the execution time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(lda));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(ldb));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(ldc));

auto [smem_a, smem_b, smem_c] = BLAS::slice_shared_memory(smem, lda, ldb, ldc);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a(lda));
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b(ldb));
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c(ldc));

Suggested Shared Memory Layout

In addition to get_layout_smem_* function, cuBLASDx also provides a function that returns the suggested custom shared memory layouts for matrix A, B or C, determined by the value types, matrix sizes, arrangements, alignments, block size, and GPU architecture. Those suggested layouts were designed to positively impact the performance of matrix multiplication itself as well as copy operations between shared and global memory, and because of that they rely on the arrangements of the matrices. Suggested layouts ignore leading dimensions set with LeadingDimension operator. The best improvements are expected with row-major A and column-major B.

__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_a();
__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_b();
__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_c();

BLAS::suggest_layout_smem_<a/b/c>() returns a combination of a shared memory tag and the layout of A/B/C (cute::Layout), which can be directly passed to cublasdx::make_tensor to create a tensor.

Example

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Block() + Precision<float, float, double>());

extern __shared__ __align__(16) char smem[];

// Slice shared memory into pointer for A, B, and C matrices
auto [smem_a, smem_b, smem_c] = BLAS::slice_shared_memory(smem);

// Create suggested shared memory layout for optimal performance
auto suggested_smem_a = cublasdx::make_tensor(smem_a, BLAS::suggest_layout_smem_a());
auto suggested_smem_b = cublasdx::make_tensor(smem_b, BLAS::suggest_layout_smem_b());
auto suggested_smem_c = cublasdx::make_tensor(smem_c, BLAS::suggest_layout_smem_c());