Execution Methods#
Execution methods are used to run the BLAS functions as defined by user with cuBLASDx operators.
Note
Right now, cuBLASDx supports only execution on CUDA thread block level (block execution).
Block Execute Method#
The block execution methods are available if the descriptor has been constructed using the Block Operator
and is_complete_blas_execution Trait is true
.
Register API#
Method execute(...)
runs the calculations defined by the BLAS
descriptor, accepting two types of arguments.
using BLAS = decltype(cublasdx::Size<M, N, K>() + ...);
// #1 - Registers with accumulator API
template<class AEngine, class ALayout, // Types derived from pointer and layout used to create tensor_a
class BEngine, class BLayout, // Types derived from pointer and layout used to create tensor_b
class CEngine, class CLayout, // Types derived from pointer and layout used to create tensor_c
class ALoadOp = cublasdx::identity, // Transform operation applied when data is loaded from matrix A
class BLoadOp = cublasdx::identity> // Transform operation applied when data is loaded from matrix B
inline __device__ void execute(const cublasdx::tensor<AEngine, ALayout>& tensor_a,
const cublasdx::tensor<BEngine, BLayout>& tensor_b,
cublasdx::tensor<CEngine, CLayout> & tensor_c,
const ALoadOp& a_load_op = {},
const BLoadOp& b_load_op = {})
// #2 - Registers without accumulator API
template<class AEngine, class ALayout,
class BEngine, class BLayout,
class ALoadOp = cublasdx::identity,
class BLoadOp = cublasdx::identity>
inline __device__ auto execute(const cublasdx::tensor<AEngine, ALayout>& tensor_a,
const cublasdx::tensor<BEngine, BLayout>& tensor_b,
const ALoadOp& a_load_op = {},
const BLoadOp& b_load_op = {})
Method #1 accepts shared memory tensors for A
and B
matrices, but a register fragment for C
matrix. It returns nothing, as it will add the result
of multiplying A
with B
to C
, resulting in:
\(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + \mathbf{C}_{m\times n}\)
The register fragment must exist beforehand, either from a previous execution of method #2, or created from a partitioner object (see Data Partitioner). It must match exactly the precision and partitioning of GEMM which it is used for.
Method #2 accepts only shared memory tensors for A
and B
matrices. It returns an opaque tuple containing resulting register fragment together with its corresponding
partitioner. The results correspond to the result of:
\(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n}\)
A C++17 structured binding can be used to cleanly retrieve both values:
auto [c_register_fragment, partitioner] = BLAS().execute(a_shared_tensor, b_shared_tensor, a_load_op, b_load_op);
The code example below shows how the two execute(...)
methods can be used.
#include <cublasdx.hpp>
using GEMM = decltype(cublasdx::Size<32, 32, 32>()
+ cublasdx::Precision<cublasdx::tfloat32_t, cublasdx::tfloat32_t, float>()
+ cublasdx::Type<cublasdx::type::real>()
+ cublasdx::Arrangement<cublasdx::row_major, cublasdx::col_major>()
+ cublasdx::Function<cublasdx::function::MM>()
+ cublasdx::MaxAlignment() // max alignment (16, 16, 16) is the default
+ cublasdx::SM<800>()
+ cublasdx::Block());
using a_data_type = typename GEMM::a_value_type;
using b_data_type = typename GEMM::b_value_type;
using c_data_type = typename GEMM::c_value_type;
extern __shared__ __align__(16) char smem[];
// smem_<a/b> are aligned to cublasdx::alignment_of<GEMM>::<a/b>
auto [smem_a, smem_b] = cublasdx::slice_shared_memory_ab<GEMM>(smem);
//*********** Method #1, register API with accumulator
{
// Make global memory tensor
auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c());
// Make shared memory tensor
auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b());
// Load data from global to shared memory using cublasdx::copy API
using alignment = cublasdx::alignment_of<GEMM>;
cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy_wait();
// Execute
auto partitioner = GEMM::get_partitioner();
auto c_register_fragment = partitioner.make_accumulator_fragment();
cublasdx::copy_fragment<alignment::c>(c_global_tensor, c_register_fragment, partitioner);
GEMM().execute(a_shared_tensor, b_shared_tensor, c_register_fragment);
// Store back to global memory using cublasdx::copy_fragment API
cublasdx::copy_fragment<alignment::c>(c_register_fragment, c_global_tensor, partitioner);
}
//*********** Method #2, cublasdx tensor APIs, without accumulator
{
// Make global memory tensor
auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c());
// Make shared memory tensor
auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b());
// Load data from global to shared memory using cublasdx::copy API
using alignment = cublasdx::alignment_of<GEMM>;
cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy_wait();
// Execute
auto [c_register_fragment, partitioner] = GEMM().execute(a_shared_tensor, b_shared_tensor);
// Store back to global memory using cublasdx::copy_fragment API
cublasdx::copy_fragment<alignment::c>(c_register_fragment, c_global_tensor, partitioner);
}
Input data properties#
Note
Starting from cuBLASDx 0.3.0, computational precision has been decoupled from data precision, i.e. the input / output data for each matrix can be of arbitrary type (even integral input for floating point GEMM) provided that Alignment Operator is set and at least one of those conditions is met:
It’s implicitly convertible to the data type chosen with Precision Operator and Type Operator.
For inputs: An appropriate converting loading operation is provided as one of the arguments. It takes the input type value. Its result must be at least implicitly convertible to the compute type.
For output: An appropriate converting storing operation is provided as one of the arguments. It takes the result computational type (usually
C
type as defined by Precision Operator and Type Operator). Its result must be at least implicitly convertible to the output type.
Warning
If using computation precision decoupled from input types, the Alignment Operator must be explicitly set.
The underlying element types of scalars (alpha
and beta
) by default are assumed to be BLAS::c_value_type
, but
they can be any types as long as:
the alignment and the size of each of them are the same as
BLAS::c_value_type
, andthey are convertible to
BLAS::c_value_type
Transform operation inputs#
All of the methods accept transform functors. a_load_op
, b_load_op
, c_load_op
are applied as elements are read from each matrix,
and c_store_op
is applied before the results of matrix multiplication are stored in C
matrix. Each functor has to represent an element-wise
transform which:
For load transformations: accepts respective input type to
execute(...)
method and returns value of type implicitly convertible toBLAS::<a/b/c>_value_type
.For store transformations: accepts
BLAS::<a/b/c>_value_type
and returns respective input type toexecute(...)
method.
Example
using GEMM = decltype(Size<128, 128, 128>() + Type<type::real>() + Precision<float, float, double>() + Block() + ...);
struct multiple_by_2 {
template<class T>
__device__ constexpr T operator()(const T arg) const {
return arg * static_cast<T>(2.0f);
}
};
struct negate {
template <class T>
__device__ constexpr T operator()(const T arg) const {
return -arg;
}
};
GEMM().execute(..., multiple_by_2{}, cublasdx::conjugate{}, cublasdx::identity{}, negate{});
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different
leading dimensions (LeadingDimension),
CUDA architecture (SM), or
number of threads (BlockDim)
will produce bit-identical results.
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.
Value Format#
BLAS::a_value_type
BLAS::b_value_type
BLAS::c_value_type
For complex numbers of every precision, the first value in a complex number is the real part and the second is the imaginary part.
For real number, BLAS::<a/b/c>_value_type
is same as P
in Precision<PA, PB, PC>
used to describe BLAS
(or the default precision).
Input/Output Data Format#
This section describes the input and output data format (layout) required for correct calculations.
GEMM (function::MM
)#
The tensor API for general matrix multiplication (execute() method which expects matrices represented using cublasdx::tensor) accepts matrices represented by tensors with arbitrary layouts. Since the tensor object carries all the information about the dimensions, the memory location and layout of a matrix, no other implicit assumptions are needed. The dimensions of the matrices must match the dimensions defined by Size operator. See also Get Memory Layout and Suggested shared memory Layout sections.
The pointer API for general matrix multiplication (#2 and #3 overloads of execute()) assumes that
values in input matrices matrix_a
, matrix_b
, matrix_c
are stored as defined by the Arrangement
operator added to the description (by default it’s row-major format for matrix_a
, column-major for matrix_b
, and column-major for matrix_c
).