Other Methods#

Get Memory Layout#

BLAS::get_layout_<gmem/smem>_<a/b/c>() function fetches global memory or shared memory CuTe layout for matrix A, B or C, determined by matrix sizes, arrangement, and leading dimensions. For shared memory layouts the leading dimensions, if not specified explicitly through a parameter, will be inferred from the leading dimensions operator. For global memory layouts custom leading dimensions must be passed either through a static or dynamic integral type, or otherwise they will be inferred from matrix sizes.

__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c();

__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_a();
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_b();
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_c();

// Overloads for specifying the leading dimensions statically during compilation time.
// integral_type can be either signed or unsigned integer type and integral_value follow
// this specification.
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const std::integral_constant<integral_type, lda>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const std::integral_constant<integral_type, ldb>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const std::integral_constant<integral_type, ldc>);

// Overloads for specifying the leading dimensions during the execution time.

__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const unsigned int lda);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const unsigned int ldb);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const unsigned int ldc);

__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_a(const unsigned int lda);
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_b(const unsigned int ldb);
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_c(const unsigned int ldc);

BLAS::get_layout_<gmem/smem>_<a/b/c>() returns a combination of memory tag (global or shared) and the layout (cute::Layout) for matrix A, B or C which can be directly passed to cublasdx::make_tensor to create a tensor.

BLAS::get_layout_<gmem/smem>_<a/b/c>() returns a matrix layout corresponding to the order set via Arrangement operator. For example, if the order for A matrix was set to cublasdx::row-major, the returned layout follows the row-major order.

In case of dynamic leading dimensions provided by user during execution time, the function accepts the leading dimension as an argument, see the example below.

Example

using BLAS = decltype(...)

extern __shared__ __align__(16) char smem[];

// a, b, c are pointers to global memory of input matrices A and B and output matrix C
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c());

auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b());
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c());

// With leading dimensions specified during the compilation time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(std::integral_constant<int, lda>{}));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(std::integral_constant<int, ldb>{}));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(std::integral_constant<int, ldc>{}));

// With leading dimensions specified during the execution time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(lda));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(ldb));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(ldc));

auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem, lda, ldb, ldc);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a(lda));
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b(ldb));
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c(ldc));

Suggested shared memory Layout#

In addition to get_layout_smem_* function, cuBLASDx also provides a function that returns the suggested custom shared memory layouts for matrix A, B or C, determined by the value types, matrix sizes, arrangements, alignments, block size, and GPU architecture. Those suggested layouts were designed to positively impact the performance of matrix multiplication itself as well as copy operations between shared and global memory, and because of that they rely on the arrangements of the matrices. Suggested layouts ignore leading dimensions set with LeadingDimension operator. The best improvements are expected with row-major A and column-major B.

__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_a();
__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_b();
__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_c();

BLAS::suggest_layout_smem_<a/b/c>() returns a combination of a shared memory tag and the layout of A/B/C (cute::Layout), which can be directly passed to cublasdx::make_tensor to create a tensor.

Example

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Block() + Precision<float, float, double>());

extern __shared__ __align__(16) char smem[];

// Slice shared memory into pointer for A, B, and C matrices
auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem);

// Create suggested shared memory layout for optimal performance
auto suggested_smem_a = cublasdx::make_tensor(smem_a, BLAS::suggest_layout_smem_a());
auto suggested_smem_b = cublasdx::make_tensor(smem_b, BLAS::suggest_layout_smem_b());
auto suggested_smem_c = cublasdx::make_tensor(smem_c, BLAS::suggest_layout_smem_c());

Data Partitioner#

Data partitioner is an object aware of execution context and implementation details of a GEMM, from which it is able to infer which elements of C matrix will be mapped to any thread. cuBLASDx uses partitioners as helper objects for partitioning global and shared memory tensors, as well as getting, copying, modifying and transforming register fragments.

Please refer to Partitioner And Register Fragment Tensors for more information on available partitioner functionality.

Get data partitioner#

Default data partitioner is used for non-suggested execution contexts. Mixing contexts may cause subpar performance.

// Get layouts
auto a_smem_layout = BLAS::get_layout_smem_a();
auto b_smem_layout = BLAS::get_layout_smem_b();

// Get partitioner
auto partitioner = BLAS::get_partitioner();

Suggest data partitioner#

Suggested data partitioner is used for suggested execution contexts. Mixing contexts may cause subpar performance.

// Suggest layouts
auto a_smem_layout = BLAS::suggest_layout_smem_a();
auto b_smem_layout = BLAS::suggest_layout_smem_b();

// Suggest partitioner
auto partitioner = BLAS::suggest_partitioner();