Execution Methods¶
Execution methods are used to run the BLAS function as defined by user with cuBLASDx operators.
Note
Right now, cuBLASDx supports only execution on CUDA thread block level (block execution).
A code example:
#include <cublasdx.hpp>
using GEMM = decltype(cublasdx::Size<32, 32, 32>()
+ cublasdx::Precision<double>()
+ cublasdx::Type<cublasdx::type::real>()
+ cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
+ cublasdx::Function<cublasdx::function::MM>()
+ cublasdx::SM<700>()
+ cublasdx::Block());
using data_type = typename GEMM::value_type;
__shared__ data_type sa[GEMM::a_size];
__shared__ data_type sb[GEMM::b_size];
__shared__ data_type sc[GEMM::c_size];
// Load data from global to shared memory
// sa <-- a
// sb <-- b
// sc <-- c
// Execute
GEMM().execute(alpha, sa, sb, beta, sc);
// Store results back to global memory
Block Execute Method¶
The block execution methods are available if the descriptor has been constructed using the Block Operator
and is_complete_blas_execution Trait is true
.
// #1
void BLAS().execute<typename T>(const T alpha,
T* matrix_a,
T* matrix_b,
const T beta,
T* matrix_c)
// #2: Allows providing runtime/dynamic leading dimensions for matrices A, B, and C
void BLAS().execute<typename T>(const T alpha,
T* matrix_a,
const unsigned int lda,
T* matrix_b,
const unsigned int ldb,
const T beta,
T* matrix_c,
const unsigned int ldc)
Method execute(...)
runs the calculations defined by the BLAS
descriptor.
By default T
is BLAS::value_type
, but it can be any type (such as float2
,
cuda::std::complex<double>
), as long as its alignment and size are the same as those of BLAS::value_type.
Pointers matrix_a
, matrix_b
, matrix_c
must point to a shared memory regions aligned to alignof(BLAS::value_type)
.
Method #2 allows user to provide custom dynamic leading dimensions via lda
, ldb
, and ldc
arguments.
In this case, leading dimension values set via LeadingDimension operator are ignored.
Values lda
, ldb
, and ldc
have to follow the same rules as presented in LeadingDimension operator.
After methods #1 and #2 user has to perform CUDA block synchronization before accessing matrix_a
, matrix_b
, matrix_c
.
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different
leading dimensions (LeadingDimension),
CUDA architecture (SM), or
number of threads (BlockDim)
will produce bit-identical results.
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.
Value Format¶
BLAS::value_type
For complex numbers of every precision, the first value in a complex number is the real part and the second is the imaginary part.
For real number, BLAS::value_type
is same as P
in Precision<P>
used to describe BLAS
(or the default precision).
Input/Output Data Format¶
This section describes the input and output data format required for correct calculations.
GEMM (function::MM
)¶
The general matrix multiplication assumes that values in matrices matrix_a
, matrix_b
, matrix_c
are in column-major format.
Results are stored in matrix_c
in column-major format too.