GEMM Execution Methods#
Execution methods are used to run the BLAS functions as defined by user with cuBLASDx operators.
Note
Right now, cuBLASDx supports only execution on CUDA thread block level (block execution).
Block Execute Method#
The block execution methods are available if the descriptor has been constructed using the Block Operator
and is_complete_blas_execution Trait is true.
Register API#
Method execute(...) runs the calculations defined by the BLAS descriptor, accepting two types of arguments.
using BLAS = decltype(cublasdx::Size<M, N, K>() + ...);
// #1 - Registers with accumulator API
template<class AEngine, class ALayout, // Types derived from pointer and layout used to create tensor_a
class BEngine, class BLayout, // Types derived from pointer and layout used to create tensor_b
class BlasAccumulator, // Object created from get_accumulator / suggest_accumulator
class ALoadOp = cublasdx::identity, // Transform operation applied when data is loaded from matrix A
class BLoadOp = cublasdx::identity> // Transform operation applied when data is loaded from matrix B
inline __device__ void execute(const cublasdx::tensor<AEngine, ALayout>& tensor_a,
const cublasdx::tensor<BEngine, BLayout>& tensor_b,
BlasAccumulator && blas_accumulator,
const ALoadOp& a_load_op = {},
const BLoadOp& b_load_op = {})
// #2 - Registers without accumulator API
template<class AEngine, class ALayout,
class BEngine, class BLayout,
class ALoadOp = cublasdx::identity,
class BLoadOp = cublasdx::identity>
inline __device__ auto execute(const cublasdx::tensor<AEngine, ALayout>& tensor_a,
const cublasdx::tensor<BEngine, BLayout>& tensor_b,
const ALoadOp& a_load_op = {},
const BLoadOp& b_load_op = {})
Method #1 accepts shared memory tensors for A and B matrices, and an accumulator. It returns nothing, as it will put the result
of multiplying A with B to C, resulting in:
\(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n}\)
The accumulator must exist beforehand, either from a previous execution of method #2, or created from a BLAS object (see Data Accumulator). It must match exactly the precision and partitioning of GEMM which it is used for.
Method #2 accepts only shared memory tensors for A and B matrices. It returns an accumulator. The results correspond to the result of:
\(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n}\)
The code example below shows how the two execute(...) methods can be used.
#include <cublasdx.hpp>
using GEMM = decltype(cublasdx::Size<32, 32, 32>()
+ cublasdx::Precision<cublasdx::tfloat32_t, cublasdx::tfloat32_t, float>()
+ cublasdx::Type<cublasdx::type::real>()
+ cublasdx::Arrangement<cublasdx::row_major, cublasdx::col_major>()
+ cublasdx::Function<cublasdx::function::MM>()
+ cublasdx::MaxAlignment() // max alignment (16, 16, 16) is the default
+ cublasdx::SM<800>()
+ cublasdx::Block());
using a_data_type = typename GEMM::a_value_type;
using b_data_type = typename GEMM::b_value_type;
using c_data_type = typename GEMM::c_value_type;
extern __shared__ __align__(16) cublasdx::byte smem[];
// smem_<a/b> are aligned to cublasdx::alignment_of<GEMM>::<a/b>
auto [smem_a, smem_b] = cublasdx::slice_shared_memory_ab<GEMM>(smem);
//*********** Method #1, register API with accumulator
{
// Make global memory tensor
auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c());
// Make shared memory tensor
auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b());
// Load data from global to shared memory using cublasdx::copy API
using alignment = cublasdx::alignment_of<GEMM>;
cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy_wait();
// Execute
auto accumulator = GEMM::get_accumulator();
GEMM().execute(a_shared_tensor, b_shared_tensor, accumulator);
// Store back to global memory using cublasdx::copy_fragment API
accumulator.partition_and_store(c_global_tensor);
}
//*********** Method #2, register API with accumulator
{
// Make global memory tensor
auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c());
// Make shared memory tensor
auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b());
// Load data from global to shared memory using cublasdx::copy API
using alignment = cublasdx::alignment_of<GEMM>;
cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy_wait();
// Execute
auto accumulator = GEMM().execute(a_shared_tensor, b_shared_tensor);
// alternatively
// auto accumulator = GEMM::get_accumulator();
// GEMM().execute(a_shared_tensor, b_shared_tensor, accumulator);
// Store back to global memory using cublasdx::copy_fragment API
accumulator.partition_and_store(c_global_tensor);
}
Input data properties#
Note
Starting from cuBLASDx 0.3.0, computational precision has been decoupled from data precision, i.e. the input / output data for each matrix can be of arbitrary type (even integral input for floating point GEMM) provided that Alignment Operator is set and at least one of those conditions is met:
It’s implicitly convertible to the data type chosen with Precision Operator and Type Operator.
For inputs: An appropriate converting loading operation is provided as one of the arguments. It takes the input type value. Its result must be at least implicitly convertible to the compute type.
For output: An appropriate converting storing operation is provided as one of the arguments. It takes the result computational type (usually
Ctype as defined by Precision Operator and Type Operator). Its result must be at least implicitly convertible to the output type.
Warning
If using computation precision decoupled from input types, the Alignment Operator must be explicitly set.
The underlying element types of scalars (alpha and beta) by default are assumed to be BLAS::c_value_type, but
they can be any types as long as:
the alignment and the size of each of them are the same as
BLAS::c_value_type, andthey are convertible to
BLAS::c_value_type
Transform operation inputs#
All GEMM methods accept transform functors. a_load_op, b_load_op, c_load_op are applied as elements are read from each matrix,
and c_store_op is applied before the results of matrix multiplication are stored in the C matrix. Each functor has to represent an element-wise
transform which:
For load transformations: accepts respective input type to
execute(...)method and returns value of type implicitly convertible toBLAS::<a/b/c>_value_type.For store transformations: accepts
BLAS::<a/b/c>_value_typeand returns respective input type toexecute(...)method.
cuBLASDx provides two built-in stateless functors; see cublasdx::identity:
cublasdx::identity— passes the value through unchanged (default for all positions).cublasdx::conjugate— applies complex conjugation; is a no-op for real types.
Example
using GEMM = decltype(Size<128, 128, 128>() + Type<type::real>() + Precision<float, float, double>() + Block() + ...);
struct multiple_by_2 {
template<class T>
__device__ constexpr T operator()(const T arg) const {
return arg * static_cast<T>(2.0f);
}
};
struct negate {
template <class T>
__device__ constexpr T operator()(const T arg) const {
return -arg;
}
};
GEMM().execute(..., multiple_by_2{}, cublasdx::conjugate{}, cublasdx::identity{}, negate{});
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different
leading dimensions (LeadingDimension),
CUDA architecture (SM), or
number of threads (BlockDim)
will produce bit-identical results.
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.