Operators¶
Operators are used to describe the properties and to configure the execution of the problem we want to solve. They are divided into Description Operators and Execution Operators.
Description Operators¶
Operator |
Default Value |
Description |
---|---|---|
Not set |
Defines the problem size. |
|
Not set |
BLAS function. Use |
|
Not set |
Transpose mode of the each matrix. |
|
|
Precision of the floating-point values used for data
and computation - |
|
|
Type of input and output data ( |
|
As defined by |
Leading dimensions for matrices |
|
Not set |
Target CUDA architecture for which the BLAS function should be generated. |
Description operators define the problem we want to solve. Combined with Execution Operators, they form a complete function descriptor that can be executed on a GPU.
Operators are added (in arbitrary order) to construct the operation descriptor type. For example, to describe a matrix multiplication
for non-transposed matrices A (m x k
), B (k x n
), C (m x n
) with complex double precision values where m = 8
, n = 16
, k = 32
for
execution on Volta architecture, one would write:
#include <cublasdx.hpp>
using GEMM = decltype(cublasdx::Size<8, 16, 32>()
+ cublasdx::Precision<double>()
+ cublasdx::Type<cublasdx::type::complex>()
+ cublasdx::TransposeMode<cublasdx::N, cublasdx::N>()
+ cublasdx::Function<cublasdx::function::MM>()
+ cublasdx::SM<700>());
For a function descriptor to be complete, the following is required:
One, and only one, Size Operator.
One, and only one, TransposeMode Operator.
One, and only one, Function Operator.
One, and only one, SM Operator.
Size Operator¶
cublasdx::Size<unsigned int M, unsigned int N, unsigned int K>()
Sets the problem size of the function to be executed.
For GEMM:
M
- logical number of rows in matricesop(A)
andC
.N
- logical number of columns in matricesop(B)
andC
.K
- logical number of columns in matrixop(A)
and rows inC
.
For example, for GEMM M
, N
, and K
specify that the A
(M x K
) matrix is multiplied by B (K x N
) matrix which results in C (M x N
) matrix (assuming
A
and B
are non-transposed). See TransposeMode and GEMM.
Type Operator¶
cubladx::Type<cublasdx::type T>()
namespace cublasdx {
enum class type
{
real,
complex
};
}
Sets the type of input and output data used in computation. Use type::real
for real data type, and type::complex
for complex data type.
Precision Operator¶
cublasdx::Precision<P>()
Sets the floating-point precision used in computation.
Type P
- half
, float
or double
, is the type of the values used for input and output,
as well as the underlying type of the values used to compute.
TransposeMode Operator¶
cublasdx::TransposeMode<cublasdx::transpose_mode ATransposeMode, cublasdx::transpose_mode BTransposeMode>()
namespace cublasdx {
enum class transpose_mode
{
non_transposed,
transposed,
conj_transposed,
};
inline constexpr auto N = transpose_mode::non_transposed;
inline constexpr auto T = transpose_mode::transposed;
inline constexpr auto C = transpose_mode::conj_transposed;
}
Sets the transpose mode for the A
and B
matrices used in the function.
For example, TransposeMode<N, N>()
sets transpose mode of A
and B
matrix as non-transposed
for GEMM.
Possible values for transpose mode are:
transpose_mode::non_transposed
,transpose_mode::transposed
, andtranspose_mode::conj_transposed
(conjugated transposed).
LeadingDimension Operator¶
cublasdx::LeadingDimension<unsigned int LDA, unsigned int LDB, unsigned int LDC>()
Defines leading dimensions for matrices A
, B
, and C
.
The leading dimension is the stride (in elements) to the beginning of next column in memory.
The leading dimension always refers to the length of the first dimension of \(Op(A)\), \(Op(B)\) or \(C\):
(see TransposeMode and GEMM) as described below:
Real dimensions of matrix
A
areLDA x K
withLDA >= M
ifA
is non-transposed, andLDA x M
withLDA >= K
otherwise.Real dimensions of matrix
B
areLDB x N
withLDA >= K
ifB
is non-transposed, andLDB x K
withLDB >= N
otherwise.Real dimensions of matrix
C
areLDC x N
withLDC >= M
.
See also, suggested_leading_dimension_of.
Function Operator¶
cublasdx::Function<cublasdx::function F>()
namespace cublasdx {
enum class function
{
MM
};
}
Sets the BLAS function to be executed.
General Matrix Multiply¶
Function<function::MM>
sets the operation to general matrix multiply, defined as:
\(C = {\alpha} * op(A) * op(B) + {\beta} * C\)
where \({\alpha}\) and \({\beta}\) are scalars (real or complex), and A
, B
, and C
are column-major matrices
with dimensions \(op(A): M x K\), \(op(B): K x N\), and \(C: M x N\), respectively.
\(op(A)\) for is defined as:
\(op(B)\) for is defined as:
SM Operator¶
cublasdx::SM<unsigned int CC>()
Sets the target architecture CC
for the underlying BLAS function to use. Supported architectures are:
Volta:
700
and720
(sm_70, sm_72).Turing:
750
(sm_75).Ampere:
800
,860
and870
(sm_80, sm_86, sm_87).Ada:
890
(sm_89).Hopper:
900
(sm_90, sm_90a).
Note
When compiling cuBLASDx for 9.0a compute capability use 900
in the SM
operator (see also CUDA C++ Programming Guide: Feature Availability).
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.
Execution Operators¶
Execution operators configure how the function will run on the GPU. Combined with Description Operators, they form a complete function descriptor that can be executed on a GPU.
Operator |
Description |
---|---|
Creates block execution object. See Block Configuration Operators. |
Block Operator¶
cublasdx::Block()
Generates a collective operation to run in a single CUDA block. Threads will cooperate to compute the collective operation. The layout and the number of threads participating in the execution, can be configured using Block Configuration Operators.
For example, the following code example creates a function descriptor for GEMM function that will run in a single CUDA block:
#include <cublasdx.hpp>
using GEMM = decltype(cublasdx::Size<32, 32, 64>()
+ cublasdx::Precision<double>()
+ cublasdx::Type<cublasdx::type::real>()
+ cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
+ cublasdx::Function<cublasdx::function::MM>()
+ cublasdx::SM<700>()
+ cublasdx::Block());
Block Configuration Operators¶
Block-configuration operators allow the user to configure block size of a single CUDA block.
Operators |
Default value |
Description |
---|---|---|
Based on heuristics |
Number of threads used to perform BLAS function. |
Note
Block configuration operators can only be used with Block Operator.
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different
leading dimensions (LeadingDimension),
CUDA architecture (SM), or
number of threads (BlockDim)
will produce bit-identical results.
BlockDim Operator¶
struct cublasdx::BlockDim<unsigned int X, unsigned int Y, unsigned int Z>()
Sets the CUDA block size to (X, Y, Z)
to configure the execution, meaning it sets number of threads participating
in the execution and their layout.
Using this operator, user can run the BLAS function in 1D, 2D or 3D block with different number of threads.
Set block dimension can be accessed via BLAS::block_dim
trait.
Adding BlockDim<X, Y, Z>
to the description puts the following requirements on the execution of the BLAS function:
Kernel must be launched with 3D block dimensions
dim3(X1, Y1, Z1)
whereX1 >= X
,Y1 >= Y
, andZ1 >= Z
, also:For 1D
BlockDim<X>
kernel must be launched withdim3(X1, Y1, Z1)
whereX1 >= X
.For 2D
BlockDim<X, Y>
kernel must be launched withdim3(X, Y1, Z1)
whereY1 >= Y
.For 3D
BlockDim<X, Y, Z>
kernel must be launched withdim3(X, Y, Z1)
whereZ1 >= Z
.
X * Y * Z
threads must be participating in the execution.The participating threads must be consecutive (adjacent) threads.
The listed requirements may be lifted or loosened in the future releases of cuBLASDx.
Note
cuBLASDx can’t validate all kernel launch configuration at runtime and check that all requirements are met, thus it is user responsibility to adhere to the rules listed above. Violating those rules is considered undefined behavior and can lead to incorrect results and/or failures.
Examples
BlockDim<64>
, kernel launched with block dimensions dim3(128, 1, 1)
- OKBlockDim<64>
, kernel launched with block dimensions dim3(64, 4, 1)
- OKBlockDim<64>
, kernel launched with block dimensions dim3(64, 2, 2)
- OKBlockDim<16, 16>
, kernel launched with block dimensions dim3(16, 32, 1)
- OKBlockDim<16, 16>
, kernel launched with block dimensions dim3(16, 16, 2)
- OKBlockDim<8, 8, 8>
, kernel launched with block dimensions dim3(8, 8, 16)
- OKBlockDim<64>
, kernel launched with block dimensions dim3(32, 1, 1)
- INCORRECTBlockDim<64>
, kernel launched with block dimensions dim3(32, 2, 1)
- INCORRECTBlockDim<16, 16>
, kernel launched with block dimensions dim3(256, 1, 1)
- INCORRECTBlockDim<8, 8, 8>
, kernel launched with block dimensions dim3(512, 2, 1)
- INCORRECTThe value of BlockDim
can be accessed from BLAS description via BLAS::block_dim trait.
When BlockDim
is not set, the default block dimensions are used
(the default value is BLAS::suggested_block_dim).
If the default block dimensions provided by cuBLASDx is smaller than the ones optimal for a kernel, it may still be optimal to try the default before increasing the number of threads contributing to the calculations.
Restrictions
X * Y * Z
must be greater than or equal to 32.
Note
It’s recommended that
X * Y * Z
is 32, 64, 128, 256, 512, or 1024.It’s recommended that
X * Y * Z
is a multiple of 32.