Execution Methods¶
Execution methods are used to run the BLAS function as defined by user with cuBLASDx operators.
Note
Right now, cuBLASDx supports only execution on CUDA thread block level (block execution).
Block Execute Method¶
The block execution methods are available if the descriptor has been constructed using the Block Operator
and is_complete_blas_execution Trait is true
.
Method execute(...)
runs the calculations defined by the BLAS
descriptor, accepting three types of arguments.
using BLAS = decltype(cublasdx::Size<M, N, K>() + ...);
// #1 - Tensor API
//
template<
class AEngine, class ALayout,
class BEngine, class BLayout,
class CEngine, class CLayout,
class ALoadTransformOp = cublasdx::identity, // Transform operation applied when data is loaded from matrix A
class BLoadTransformOp = cublasdx::identity, // Transform operation applied when data is loaded from matrix B
class CLoadTransformOp = cublasdx::identity, // Transform operation applied when data is loaded from matrix C
class CStoreTransformOp = cublasdx::identity // Transform operation applied when data is store to matrix C
>
void BLAS::execute(const typename CEngine::value_type& alpha,
cublasdx::tensor<AEngine, ALayout> matrix_a,
cublasdx::tensor<BEngine, BLayout> matrix_b,
const typename CEngine::value_type& beta,
cublasdx::tensor<CEngine, CLayout> matrix_c,
const ALoadTransformOp& a_load_op = {},
const BLoadTransformOp& b_load_op = {},
const CLoadTransformOp& c_load_op = {},
const CStoreTransformOp& c_store_op = {})
// #2 - Pointer API
template<
class TA, // Value type of matrix A
class TB, // Value type of matrix B
class TC, // Value type of matrix C
class ALoadTransformOp = cublasdx::identity, // Transform operation applied when data is loaded from matrix A
class BLoadTransformOp = cublasdx::identity, // Transform operation applied when data is loaded from matrix B
class CLoadTransformOp = cublasdx::identity, // Transform operation applied when data is loaded from matrix C
class CStoreTransformOp = cublasdx::identity // Transform operation applied when data is store to matrix C
>
void BLAS::execute(const TC& alpha,
TA* matrix_a,
TB* matrix_b,
const TC& beta,
TC* matrix_c,
const ALoadTransformOp& a_load_op = {},
const BLoadTransformOp& b_load_op = {},
const CLoadTransformOp& c_load_op = {},
const CStoreTransformOp& c_store_op = {})
// #3 - Pointer API, which allows providing runtime/dynamic leading dimensions for matrices A, B, and C
template<
class TA,
class TB,
class TC,
class ALoadTransformOp = cublasdx::identity,
class BLoadTransformOp = cublasdx::identity,
class CLoadTransformOp = cublasdx::identity,
class CStoreTransformOp = cublasdx::identity
>
void BLAS::execute(const TC& alpha,
TA* matrix_a,
const unsigned int lda,
TB* matrix_b,
const unsigned int ldb,
const TC& beta,
TC* matrix_c,
const unsigned int ldc,
const ALoadTransformOp& a_load_op = {},
const BLoadTransformOp& b_load_op = {},
const CLoadTransformOp& c_load_op = {},
const CStoreTransformOp& c_store_op = {})
Method #1 accepts cublasdx::tensor
as representations of share memory storage for matrices A
, B
, and C
.
This is a new feature of cuBLASDx 0.2.0.
cublasdx::tensor
is essentially CuTe tensor (cute::Tensor),
a representation of the multi-dimensional array,
with rich functionality abstracting away the details of how the array’s elements are organized and stored in memory.
The underlying element types of tensors (cublasdx::tensor<Engine, Layout>::value_type
) by default are assumed to be BLAS::<a/b/c>_value_type
, but
they can be any types as long as the alignment and the size of each of them are the same as the corresponding BLAS::<a/b/c>_value_type
type.
See Tensor, Tensor Creation, and Get Memory Layout for how to create tensor using raw memory pointer
and CuTe layout, and if needed, dynamically defined leading dimensions.
If necessary, user can pass tensors with custom layouts.
In Method #2 and #3, by default T<A/B/C>
is BLAS::<a/b/c>_value_type
, but it can be any type (such as float2
,
cuda::std::complex<double>
), as long as its alignment and size are the same as those of BLAS::.
Pointers matrix_a
, matrix_b
, matrix_c
must point to shared memory regions aligned to BLAS::<a/bc/>_alignment
.
If Alignment operator was not used, BLAS::<a/bc/>_alignment
is equal to alignof(BLAS::<a/b/c>_value_type)
.
Methods #2 and #3 assumes the layout of each matrix corresponds to the arrangement set for that matrix, i.e. if
Arrangement<col-major, row-major, col-major>
was used in the BLAS
description A matrix should be column-major, B matrix - row-major,
and C matrix - column-major. The default arrangement corresponds to using Arrangement<row-major, col-major, col-major>
.
Method #3 allows user to provide custom dynamic leading dimensions via lda
, ldb
, and ldc
arguments.
In this case, leading dimension values set via LeadingDimension operator are ignored.
Values lda
, ldb
, and ldc
have to follow the same rules as presented in LeadingDimension operator.
After the execution function user has to perform CUDA block synchronization before accessing matrix_a
, matrix_b
, matrix_c
.
The code example below shows how the three execute(...)
methods can be used.
#include <cublasdx.hpp>
using GEMM = decltype(cublasdx::Size<32, 32, 32>()
+ cublasdx::Precision<cublasdx::tfloat32_t, cublasdx::tfloat32_t, float>()
+ cublasdx::Type<cublasdx::type::real>()
+ cublasdx::Arrangement<cublasdx::row_major, cublasdx::col_major>()
+ cublasdx::Function<cublasdx::function::MM>()
+ cublasdx::MaxAlignment() // max alignment (16, 16, 16) is the default
+ cublasdx::SM<800>()
+ cublasdx::Block());
using a_data_type = typename GEMM::a_value_type;
using b_data_type = typename GEMM::b_value_type;
using c_data_type = typename GEMM::c_value_type;
extern __shared__ __align__(16) char smem[];
// smem_<a/b/c> are aligned to cublasdx::alignment_of<GEMM>::<a/b/c>
auto [smem_a, smem_b, smem_c] = GEMM::slice_shared_memory(smem);
//*********** Method #1, using cublasdx tensor APIs
{
// Make global memory tensor
auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c());
// Make shared memory tensor
auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b());
auto c_shared_tensor = cublasdx::make_tensor(smem_c, GEMM::get_layout_smem_c());
// Load data from global to shared memory using cublasdx::copy API
using alignment = cublasdx::alignment_of<GEMM>;
cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy<GEMM, alignment::c>(c_global_tensor, c_shared_tensor);
cublasdx::copy_wait();
// Execute
GEMM().execute(alpha, a_shared_tensor, b_shared_tensor, beta, c_shared_tensor);
__syncthreads();
// Store back to global memory using cublasdx::copy API
cublasdx::copy<GEMM, alignment::c>(c_shared_tensor, c_global_tensor);
cublasdx::copy_wait(); // Only needed if more operations on shared memory used in c_shared_tensor happens in the kernel
}
//*********** Method #1, cublasdx tensor APIs, with dynamic leading dimensions
{
// Make global memory tensor
auto a_global_tensor = cublasdx::make_tensor(a, GEMM::get_layout_gmem_a(lda));
auto b_global_tensor = cublasdx::make_tensor(b, GEMM::get_layout_gmem_b(ldb));
auto c_global_tensor = cublasdx::make_tensor(c, GEMM::get_layout_gmem_c(ldb));
// Make shared memory tensor
auto a_shared_tensor = cublasdx::make_tensor(smem_a, GEMM::get_layout_smem_a(lda));
auto b_shared_tensor = cublasdx::make_tensor(smem_b, GEMM::get_layout_smem_b(ldb));
auto c_shared_tensor = cublasdx::make_tensor(smem_c, GEMM::get_layout_smem_c(ldc));
// Load data from global to shared memory using cublasdx::copy API
using alignment = cublasdx::alignment_of<GEMM>;
cublasdx::copy<GEMM, alignment::a>(a_global_tensor, a_shared_tensor);
cublasdx::copy<GEMM, alignment::b>(b_global_tensor, b_shared_tensor);
cublasdx::copy<GEMM, alignment::c>(c_global_tensor, c_shared_tensor);
cublasdx::copy_wait();
// Execute
GEMM().execute(alpha, a_shared_tensor, b_shared_tensor, beta, c_shared_tensor);
__syncthreads();
// Store back to global memory using cublasdx::copy API
cublasdx::copy<GEMM, alignment::c>(c_shared_tensor, c_global_tensor);
cublasdx::copy_wait(); // Only needed if more operations on shared memory used in c_shared_tensor happens in the kernel
}
//*********** Method #2, using raw share memory pointers
{
// User code to load data from global to shared memory
// smem_a <-- a, smem_b <-- b, smem_c <-- c
// Execute
GEMM().execute(alpha, smem_a, smem_b, beta, smem_c);
__syncthreads();
// User code to store back to global memory
// smem_a --> a, smem_b --> b, smem_c --> c
}
//*********** Method #3, with dynamic leading dimensions
{
// User code to load data from global to shared memory
// smem_a <-- a, smem_b <-- b, smem_c <-- c
// Execute
GEMM().execute(alpha, smem_a, lda, smem_b, ldb, beta, smem_c, ldc);
__syncthreads();
// User code to store back to global memory
// smem_a --> a, smem_b --> b, smem_c --> c
}
All of the methods accept four transform functors. a_load_op
, b_load_op
, c_load_op
are applied as elements are read from each matrix,
and c_store_op
is applied before the results of matrix multiplication are stored in C
matrix. Each functor has to represent an element-wise
transform which accepts value of type BLAS::<a/b/c>_value_type
and returns value of type convertible to BLAS::<a/b/c>_value_type
.
Example
using GEMM = decltype(Size<128, 128, 128>() + Type<type::real>() + Precision<float, float, double>() + Block() + ...);
struct multiple_by_2 {
template<class T>
__device__ constexpr T operator()(const T arg) const {
return arg * static_cast<T>(2.0f);
}
};
struct negate {
template <class T>
__device__ constexpr T operator()(const T arg) const {
return -arg;
}
};
GEMM().execute(..., multiple_by_2{}, cublasdx::conjugate{}, cublasdx::identity{}, negate{});
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different
leading dimensions (LeadingDimension),
CUDA architecture (SM), or
number of threads (BlockDim)
will produce bit-identical results.
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.
Value Format¶
BLAS::a_value_type
BLAS::b_value_type
BLAS::c_value_type
For complex numbers of every precision, the first value in a complex number is the real part and the second is the imaginary part.
For real number, BLAS::<a/b/c>_value_type
is same as P
in Precision<PA, PB, PC>
used to describe BLAS
(or the default precision).
Input/Output Data Format¶
This section describes the input and output data format (layout) required for correct calculations.
GEMM (function::MM
)¶
The tensor API for general matrix multiplication (execute() method which expects matrices represented using cublasdx::tensor) accepts matrices represented by tensors with arbitrary layouts. Since the tensor object carries all the information about the dimensions, the memory location and layout of a matrix, no other implicit assumptions are needed. The dimensions of the matrices must match the dimensions defined by Size operator. See also Get Memory Layout and Suggested Shared Memory Layout sections.
The pointer API for general matrix multiplication (#2 and #3 overloads of execute()) assumes that
values in input matrices matrix_a
, matrix_b
, matrix_c
are stored as defined by the Arrangement
operator added to the description (by default it’s row-major format for matrix_a
, col-major
for matrix_b
, and column-major for matrix_c
).