Operators#
Operators are used to describe the properties and to configure the execution of the problem we want to solve. They are divided into Description Operators and Execution Operators.
Description Operators#
Operator |
Default Value |
Description |
---|---|---|
Not set |
Defines the problem size. |
|
Not set |
BLAS function. Use |
|
|
Arrangement of each matrix in global memory and default arrangement in shared memory. |
|
|
Transpose mode of the each matrix (Deprecated since 0.2.0). |
|
|
Computation Precision of GEMM |
|
|
Type of input and output data ( |
|
As defined by |
Leading dimensions for matrices |
|
|
Alignments (in bytes) of matrices |
|
Not set |
Target CUDA architecture for which the BLAS function should be generated. |
Description operators define the problem we want to solve. Combined with Execution Operators, they form a complete function descriptor that can be executed on a GPU.
Operators are added (in arbitrary order) to construct the operation descriptor type. For example, to describe a matrix multiplication
for non-transposed matrices A (m x k
), B (k x n
), C (m x n
) with complex double precision values where m = 8
, n = 16
, k = 32
for
execution on Volta architecture, one would write:
#include <cublasdx.hpp>
using GEMM = decltype(cublasdx::Size<8, 16, 32>()
+ cublasdx::Precision<double>()
+ cublasdx::Type<cublasdx::type::complex>()
+ cublasdx::Arrangement<cublasdx::col_major, cublasdx::col_major>()
+ cublasdx::Function<cublasdx::function::MM>()
+ cublasdx::SM<700>());
For a function descriptor to be complete, the following is required:
One, and only one, Size Operator.
One, and only one, Arrangement Operator or TransposeMode Operator.
One, and only one, Function Operator.
One, and only one, SM Operator.
Size Operator#
cublasdx::Size<unsigned int M, unsigned int N, unsigned int K>()
Sets the problem size of the function to be executed.
For GEMM:
M
- logical number of rows in matricesop(A)
andC
.N
- logical number of columns in matricesop(B)
andC
.K
- logical number of columns in matrixop(A)
and rows inC
.
For example, for GEMM M
, N
, and K
specify that the A
(M x K
) matrix is multiplied by B
(K x N
) matrix which results in C
(M x N
) matrix (assuming A
and B
are non-transposed). See Arrangement, TransposeMode and GEMM.
Type Operator#
cubladx::Type<cublasdx::type T>;
namespace cublasdx {
enum class type
{
real,
complex
};
}
Sets the type of input and output data used in computation. Use type::real
for real data type, and type::complex
for complex data type.
Precision Operator#
cublasdx::Precision<PA, PB=PA, PC=PA>;
Sets the precision of computation for A
, B
, and C
.
The precision type can be either floating :
__half
float
double
__nv_fp8_e5m2
__nv_fp8_e4m3
__nv_bfloat16
cublasdx::tfloat32_t
or integral:
int8_t
uint8_t
int16_t
uint16_t
int32_t
uint32_t
int64_t
uint64_t
It describes the computation precision for the values used for input and output. This means that this is the type to which the input values will be converted just before executing the multiplication instruction.
Note
Starting from cuBLASDx 0.3.0, computational precision has been decoupled from data precision, i.e. the input / output data for each matrix can be of arbitrary type (even integral input for floating point GEMM) provided that Alignment Operator is set and at least one of those conditions is met:
It’s implicitly convertible to the data type chosen with Precision Operator and Type Operator.
For inputs: An appropriate converting loading operation is provided as one of the arguments. It takes the input type value. Its result must be at least implicitly convertible to the compute type.
For output: An appropriate converting storing operation is provided as one of the arguments. It takes the result computational type (usually
C
type as defined by Precision Operator and Type Operator). Its result must be at least implicitly convertible to the output type.
Warning
If using computation precision decoupled from input types, the Alignment Operator must be explicitly set.
Arrangement Operator#
cublasdx::Arrangement<cublasdx::arrangement A, cublasdx::arrangement B = cublasdx::col_major, cublasdx::arrangement C = cublasdx::col_major>;
namespace cublasdx {
enum class arrangement
{
col_major,
row_major
};
inline constexpr auto col_major = arrangement::col_major;
inline constexpr auto left_layout = arrangement::col_major;
inline constexpr auto row_major = arrangement::row_major;
inline constexpr auto right_major = arrangement::row_major;
}
Sets the order for global A
, B
, and C
matrices used in the function. The order can be either column-major or row-major.
Arrangement operator directly influences Get Memory Layout and Suggested shared memory Layout methods, by explicitly
setting the default Get Memory Layout value and choosing Suggested shared memory Layout for optimized global-shared
transfers.
Selecting a specific order for matrices doesn’t mean the function doesn’t accept matrices with different layouts, but it may influence the overall performance.
Warning
Arrangement
and TransposeMode operators can’t be defined at the same time.
TransposeMode Operator#
Warning
TransposeMode
operator is deprecated since 0.2.0 and may be removed in future versions.
cublasdx::TransposeMode<cublasdx::transpose_mode ATransposeMode, cublasdx::transpose_mode BTransposeMode>;
namespace cublasdx {
enum class transpose_mode
{
non_transposed,
transposed,
conj_transposed,
};
inline constexpr auto N = transpose_mode::non_transposed;
inline constexpr auto T = transpose_mode::transposed;
inline constexpr auto C = transpose_mode::conj_transposed;
}
Sets the transpose mode for the A
and B
matrices used in the function.
For example, TransposeMode<N, N>()
sets transpose mode of A
and B
matrix as non-transposed
for GEMM.
Possible values for transpose mode are:
transpose_mode::non_transposed
,transpose_mode::transposed
, andtranspose_mode::conj_transposed
(conjugated transposed).
Warning
Arrangement and
TransposeMode
operators can’t be defined at the same time.Using
transpose_mode::non_transposed
for a matrix inTransposeMode
operator corresponds toarrangement::col_major
in Arrangement operator.Using
transpose_mode::transposed
for a matrix inTransposeMode
operator corresponds toarrangement::row_major
in Arrangement operator.Using
transpose_mode::conj_transposed
for a matrix inTransposeMode
operator corresponds toarrangement::row_major
in Arrangement operator andcublasdx::conjugate
passed in execute() mathod as transform operator for that matrix.
LeadingDimension Operator#
cublasdx::LeadingDimension<unsigned int LDA, unsigned int LDB, unsigned int LDC>()
Defines leading dimensions for matrices A
, B
, and C
.
The leading dimension of a matrix is a stride (in elements) to the beginning of the next column for a column-major matrix
or the next row for a row-major matrix.
If Arrangement is used in the description of a general matrix multiplication
operation, A
, B
, C
matrices can be described in a following way:
Real dimensions of matrix
A
are \(LDA\times K\) withLDA >= M
ifA
is column-major, and \(LDA\times M\) withLDA >= K
otherwise.Real dimensions of matrix
B
are \(LDB\times N\) withLDA >= K
ifB
is column-major, and \(LDB\times K\) withLDB >= N
otherwise.Real dimensions of matrix
C
are \(LDC\times N\) withLDC >= M
ifC
is column-major, and \(LDC\times M\) withLDB >= N
otherwise.
A matrix can also be described using a layout (see CuTe: Layout), i.e. a pair of integer tuples: shape and stride (distance between elements). In simple terms, a shape represents the number of elements in each dimension, and stride represents distance between elements in each dimension.
A
- matrix of \(M\times K\) shape, with1
stride in 1st dimension andLDA
in 2nd dimension ifA
is column-major, or withLDA
in 1st and1
in 2nd dimension otherwise.B
- matrix of \(K\times N\) shape, with1
stride in 1st dimension andLDB
in 2nd dimension ifB
is column-major, or withLDB
in 1st and1
in 2nd dimension otherwise.C
- matrix of \(M\times N\) shape, with1
stride in 1st dimension andLDC
in 2nd dimension ifC
is column-major, or withLDC
in 1st and1
in 2nd dimension otherwise.
Warning
TransposeMode
operator is deprecated since 0.2.0 and may be removed in future versions.
If TransposeMode (deprecated) is used in the description of a general matrix multiplication
operation, in BLAS nomenclature the dimensions of the A
, B
, C
matrices can be described in a following way:
Real dimensions of matrix
A
are \(LDA\times K\) withLDA >= M
ifA
is non-transposed, and \(LDA\times M\) withLDA >= K
otherwise.Real dimensions of matrix
B
are \(LDB\times N\) withLDA >= K
ifB
is non-transposed, and \(LDB\times K\) withLDB >= N
otherwise.Real dimensions of matrix
C
are \(LDC\times N\) withLDC >= M
.
See also, suggested_leading_dimension_of.
Alignment Operator#
cublasdx::Alignment<unsigned int AAlignment, unsigned int BAlignment, unsigned int CAlignment>()
cublasdx::MaxAlignment = Alignment<16, 16, 16> // alias of maximum supported alignements
Defines the alignments (in bytes) of the pointers to matrices A
, B
, and C
(either raw pointers or wrapped in CuTe tensors) that are passed to the execute(...)
method.
Note that the alignments have direct implication for how much shared memory is required for given a description operator. Additionally, it may also impact the performance.
Requirements:
AAlignments
,BAlignments
,CAlignments
should be powers-of-two and less than or equal to the maximum supported alignments, i.e., 16 bytes.AAlignments
,BAlignments
,CAlignments
are multiple of alignment of either chosen compute value types, or input value types if decoupled precision is used.
Warning
If using computation precision decoupled from input types, the Alignment Operator must be explicitly set.
See also, Suggested Alignment Trait.
Function Operator#
cublasdx::Function<cublasdx::function F>()
namespace cublasdx {
enum class function
{
MM
};
}
Sets the BLAS function to be executed.
General Matrix Multiply#
Function<function::MM>
sets the operation to general matrix multiply, defined as one of:
\(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\)
\(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + \mathbf{C}_{m\times n}\)
\(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n}\)
where \({\alpha}\) and \({\beta}\) are scalars (real or complex), and A
, B
, and C
are matrices
with dimensions \(A: m\times k\), \(B: k\times n\), and \(C: m\times n\), respectively.
The matrices can be column-major, row-major, or have a custom layout. See Execution Methods, Get Memory Layout, Suggested shared memory Layout, and Arrangement.
SM Operator#
cublasdx::SM<unsigned int CC>()
Sets the target architecture CC
for the underlying BLAS function to use. Supported architectures are:
Volta:
700
and720
(sm_70, sm_72).Turing:
750
(sm_75).Ampere:
800
,860
and870
(sm_80, sm_86, sm_87).Ada:
890
(sm_89).Hopper:
900
(sm_90, sm_90a).
Note
When compiling cuBLASDx for 9.0a compute capability use 900
in the SM
operator (see also CUDA C++ Programming Guide: Feature Availability).
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.
Execution Operators#
Execution operators configure how the function will run on the GPU. Combined with Description Operators, they form a complete function descriptor that can be executed on a GPU.
Operator |
Description |
---|---|
Creates block execution object. See Block Configuration Operators. |
Block Operator#
cublasdx::Block()
Generates a collective operation to run in a single CUDA block. Threads will cooperate to compute the collective operation. The layout and the number of threads participating in the execution, can be configured using Block Configuration Operators.
For example, the following code example creates a function descriptor for GEMM function that will run in a single CUDA block:
#include <cublasdx.hpp>
using GEMM = decltype(cublasdx::Size<32, 32, 64>()
+ cublasdx::Precision<double, __half, double>()
+ cublasdx::Type<cublasdx::type::real>()
+ cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
+ cublasdx::Function<cublasdx::function::MM>()
+ cublasdx::SM<700>()
+ cublasdx::Block());
Block Configuration Operators#
Block-configuration operators allow the user to configure block size of a single CUDA block.
Operators |
Default value |
Description |
---|---|---|
Based on heuristics |
Number of threads used to perform BLAS function. |
Note
Block configuration operators can only be used with Block Operator.
Warning
It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different
leading dimensions (LeadingDimension),
CUDA architecture (SM), or
number of threads (BlockDim)
will produce bit-identical results.
BlockDim Operator#
struct cublasdx::BlockDim<unsigned int X, unsigned int Y, unsigned int Z>()
Sets the CUDA block size to (X, Y, Z)
to configure the execution, meaning it sets number of threads participating
in the execution and their layout.
Using this operator, user can run the BLAS function in 1D, 2D or 3D block with different number of threads.
Set block dimension can be accessed via BLAS::block_dim
trait.
Adding BlockDim<X, Y, Z>
to the description puts the following requirements on the execution of the BLAS function:
Kernel must be launched with 3D block dimensions
dim3(X1, Y1, Z1)
whereX1 >= X
,Y1 >= Y
, andZ1 >= Z
, also:For 1D
BlockDim<X>
kernel must be launched withdim3(X1, Y1, Z1)
whereX1 >= X
.For 2D
BlockDim<X, Y>
kernel must be launched withdim3(X, Y1, Z1)
whereY1 >= Y
.For 3D
BlockDim<X, Y, Z>
kernel must be launched withdim3(X, Y, Z1)
whereZ1 >= Z
.
X * Y * Z
threads must be participating in the execution.The participating threads must be consecutive (adjacent) threads.
The listed requirements may be lifted or loosened in the future releases of cuBLASDx.
Note
cuBLASDx can’t validate all kernel launch configuration at runtime and check that all requirements are met, thus it is user responsibility to adhere to the rules listed above. Violating those rules is considered undefined behavior and can lead to incorrect results and/or failures.
Examples
BlockDim<64>
, kernel launched with block dimensions dim3(128, 1, 1)
- OKBlockDim<64>
, kernel launched with block dimensions dim3(64, 4, 1)
- OKBlockDim<64>
, kernel launched with block dimensions dim3(64, 2, 2)
- OKBlockDim<16, 16>
, kernel launched with block dimensions dim3(16, 32, 1)
- OKBlockDim<16, 16>
, kernel launched with block dimensions dim3(16, 16, 2)
- OKBlockDim<8, 8, 8>
, kernel launched with block dimensions dim3(8, 8, 16)
- OKBlockDim<64>
, kernel launched with block dimensions dim3(32, 1, 1)
- INCORRECTBlockDim<64>
, kernel launched with block dimensions dim3(32, 2, 1)
- INCORRECTBlockDim<16, 16>
, kernel launched with block dimensions dim3(256, 1, 1)
- INCORRECTBlockDim<8, 8, 8>
, kernel launched with block dimensions dim3(512, 2, 1)
- INCORRECTThe value of BlockDim
can be accessed from BLAS description via BLAS::block_dim trait.
When BlockDim
is not set, the default block dimensions are used
(the default value is BLAS::suggested_block_dim).
If the default block dimensions provided by cuBLASDx is smaller than the ones optimal for a kernel, it may still be optimal to try the default before increasing the number of threads contributing to the calculations.
Restrictions
X * Y * Z
must be greater than or equal to 32.
Note
It’s recommended that
X * Y * Z
is 32, 64, 128, 256, 512, or 1024.It’s recommended that
X * Y * Z
is a multiple of 32.