Operators#

Operators are used to describe the properties and to configure the execution of the problem we want to solve. They are divided into Description Operators and Execution Operators.


Description Operators#

Operator

Default Value

Description

Size<M, N, K>

Not set

Defines the problem size.

Function<function>

Not set

BLAS function. Use function::MM for GEMM.

Arrangement<arrangement, arrangement, arrangement>

row_major, col_major, col_major

Arrangement of each matrix in global memory and default arrangement in shared memory.

TransposeMode<transpose_mode, transpose_mode>

transposed, non-transposed

Transpose mode of the each matrix (Deprecated since 0.2.0).

Precision<PA, PB, PC>

float, float, float

Computation Precision of GEMM A, B, and C - either all floating point or all integral.

Type<type>

type::real

Type of input and output data (type::real or type::complex).

LeadingDimension<LDA, LDB, LDC>

As defined by Size, and Arrangement or TransposeMode operators

Leading dimensions for matrices A, B, C.

Alignment<a_align, b_align, c_align>

alignof(BLAS::a_value_type), alignof(BLAS::b_value_type), alignof(BLAS::c_value_type)

Alignments (in bytes) of matrices A, B, C.

SM<CC>

Not set

Target CUDA architecture for which the BLAS function should be generated.

Description operators define the problem we want to solve. Combined with Execution Operators, they form a complete function descriptor that can be executed on a GPU.

Operators are added (in arbitrary order) to construct the operation descriptor type. For example, to describe a matrix multiplication for non-transposed matrices A (m x k), B (k x n), C (m x n) with complex double precision values where m = 8, n = 16, k = 32 for execution on Volta architecture, one would write:

#include <cublasdx.hpp>

using GEMM = decltype(cublasdx::Size<8, 16, 32>()
              + cublasdx::Precision<double>()
              + cublasdx::Type<cublasdx::type::complex>()
              + cublasdx::Arrangement<cublasdx::col_major, cublasdx::col_major>()
              + cublasdx::Function<cublasdx::function::MM>()
              + cublasdx::SM<700>());

For a function descriptor to be complete, the following is required:

Size Operator#

cublasdx::Size<unsigned int M, unsigned int N, unsigned int K>()

Sets the problem size of the function to be executed.

For GEMM:

  • M - logical number of rows in matrices op(A) and C.

  • N - logical number of columns in matrices op(B) and C.

  • K - logical number of columns in matrix op(A) and rows in C.

For example, for GEMM M, N, and K specify that the A (M x K) matrix is multiplied by B (K x N) matrix which results in C (M x N) matrix (assuming A and B are non-transposed). See Arrangement, TransposeMode and GEMM.

Type Operator#

cubladx::Type<cublasdx::type T>;

namespace cublasdx {
  enum class type
  {
    real,
    complex
  };
}

Sets the type of input and output data used in computation. Use type::real for real data type, and type::complex for complex data type.

Precision Operator#

cublasdx::Precision<PA, PB=PA, PC=PA>;

Sets the precision of computation for A, B, and C.

The precision type can be either floating :

  • __half

  • float

  • double

  • __nv_fp8_e5m2

  • __nv_fp8_e4m3

  • __nv_bfloat16

  • cublasdx::tfloat32_t

or integral:

  • int8_t

  • uint8_t

  • int16_t

  • uint16_t

  • int32_t

  • uint32_t

  • int64_t

  • uint64_t

It describes the computation precision for the values used for input and output. This means that this is the type to which the input values will be converted just before executing the multiplication instruction.

Note

Starting from cuBLASDx 0.3.0, computational precision has been decoupled from data precision, i.e. the input / output data for each matrix can be of arbitrary type (even integral input for floating point GEMM) provided that Alignment Operator is set and at least one of those conditions is met:

  1. It’s implicitly convertible to the data type chosen with Precision Operator and Type Operator.

  2. For inputs: An appropriate converting loading operation is provided as one of the arguments. It takes the input type value. Its result must be at least implicitly convertible to the compute type.

  3. For output: An appropriate converting storing operation is provided as one of the arguments. It takes the result computational type (usually C type as defined by Precision Operator and Type Operator). Its result must be at least implicitly convertible to the output type.

Warning

If using computation precision decoupled from input types, the Alignment Operator must be explicitly set.

Arrangement Operator#

cublasdx::Arrangement<cublasdx::arrangement A, cublasdx::arrangement B = cublasdx::col_major, cublasdx::arrangement C = cublasdx::col_major>;

namespace cublasdx {
  enum class arrangement
  {
    col_major,
    row_major
  };

  inline constexpr auto col_major   = arrangement::col_major;
  inline constexpr auto left_layout = arrangement::col_major;
  inline constexpr auto row_major   = arrangement::row_major;
  inline constexpr auto right_major = arrangement::row_major;
}

Sets the order for global A, B, and C matrices used in the function. The order can be either column-major or row-major. Arrangement operator directly influences Get Memory Layout and Suggested shared memory Layout methods, by explicitly setting the default Get Memory Layout value and choosing Suggested shared memory Layout for optimized global-shared transfers.

Selecting a specific order for matrices doesn’t mean the function doesn’t accept matrices with different layouts, but it may influence the overall performance.

Warning

Arrangement and TransposeMode operators can’t be defined at the same time.

TransposeMode Operator#

Warning

TransposeMode operator is deprecated since 0.2.0 and may be removed in future versions.

cublasdx::TransposeMode<cublasdx::transpose_mode ATransposeMode, cublasdx::transpose_mode BTransposeMode>;

namespace cublasdx {
  enum class transpose_mode
  {
    non_transposed,
    transposed,
    conj_transposed,
  };

  inline constexpr auto N = transpose_mode::non_transposed;
  inline constexpr auto T = transpose_mode::transposed;
  inline constexpr auto C = transpose_mode::conj_transposed;
}

Sets the transpose mode for the A and B matrices used in the function. For example, TransposeMode<N, N>() sets transpose mode of A and B matrix as non-transposed for GEMM. Possible values for transpose mode are:

  • transpose_mode::non_transposed,

  • transpose_mode::transposed, and

  • transpose_mode::conj_transposed (conjugated transposed).

Warning

  • Arrangement and TransposeMode operators can’t be defined at the same time.

  • Using transpose_mode::non_transposed for a matrix in TransposeMode operator corresponds to arrangement::col_major in Arrangement operator.

  • Using transpose_mode::transposed for a matrix in TransposeMode operator corresponds to arrangement::row_major in Arrangement operator.

  • Using transpose_mode::conj_transposed for a matrix in TransposeMode operator corresponds to arrangement::row_major in Arrangement operator and cublasdx::conjugate passed in execute() mathod as transform operator for that matrix.

LeadingDimension Operator#

cublasdx::LeadingDimension<unsigned int LDA, unsigned int LDB, unsigned int LDC>()

Defines leading dimensions for matrices A, B, and C. The leading dimension of a matrix is a stride (in elements) to the beginning of the next column for a column-major matrix or the next row for a row-major matrix.

If Arrangement is used in the description of a general matrix multiplication operation, A, B, C matrices can be described in a following way:

  • Real dimensions of matrix A are \(LDA\times K\) with LDA >= M if A is column-major, and \(LDA\times M\) with LDA >= K otherwise.

  • Real dimensions of matrix B are \(LDB\times N\) with LDA >= K if B is column-major, and \(LDB\times K\) with LDB >= N otherwise.

  • Real dimensions of matrix C are \(LDC\times N\) with LDC >= M if C is column-major, and \(LDC\times M\) with LDB >= N otherwise.

A matrix can also be described using a layout (see CuTe: Layout), i.e. a pair of integer tuples: shape and stride (distance between elements). In simple terms, a shape represents the number of elements in each dimension, and stride represents distance between elements in each dimension.

  • A - matrix of \(M\times K\) shape, with 1 stride in 1st dimension and LDA in 2nd dimension if A is column-major, or with LDA in 1st and 1 in 2nd dimension otherwise.

  • B - matrix of \(K\times N\) shape, with 1 stride in 1st dimension and LDB in 2nd dimension if B is column-major, or with LDB in 1st and 1 in 2nd dimension otherwise.

  • C - matrix of \(M\times N\) shape, with 1 stride in 1st dimension and LDC in 2nd dimension if C is column-major, or with LDC in 1st and 1 in 2nd dimension otherwise.

Warning

TransposeMode operator is deprecated since 0.2.0 and may be removed in future versions.

If TransposeMode (deprecated) is used in the description of a general matrix multiplication operation, in BLAS nomenclature the dimensions of the A, B, C matrices can be described in a following way:

  • Real dimensions of matrix A are \(LDA\times K\) with LDA >= M if A is non-transposed, and \(LDA\times M\) with LDA >= K otherwise.

  • Real dimensions of matrix B are \(LDB\times N\) with LDA >= K if B is non-transposed, and \(LDB\times K\) with LDB >= N otherwise.

  • Real dimensions of matrix C are \(LDC\times N\) with LDC >= M.

See also, suggested_leading_dimension_of.

Alignment Operator#

cublasdx::Alignment<unsigned int AAlignment, unsigned int BAlignment, unsigned int CAlignment>()

cublasdx::MaxAlignment = Alignment<16, 16, 16> // alias of maximum supported alignements

Defines the alignments (in bytes) of the pointers to matrices A, B, and C (either raw pointers or wrapped in CuTe tensors) that are passed to the execute(...) method.

Note that the alignments have direct implication for how much shared memory is required for given a description operator. Additionally, it may also impact the performance.

Requirements:

  • AAlignments, BAlignments, CAlignments should be powers-of-two and less than or equal to the maximum supported alignments, i.e., 16 bytes.

  • AAlignments, BAlignments, CAlignments are multiple of alignment of either chosen compute value types, or input value types if decoupled precision is used.

Warning

If using computation precision decoupled from input types, the Alignment Operator must be explicitly set.

See also, Suggested Alignment Trait.

Function Operator#

cublasdx::Function<cublasdx::function F>()

namespace cublasdx {
  enum class function
  {
    MM
  };
}

Sets the BLAS function to be executed.

General Matrix Multiply#

Function<function::MM> sets the operation to general matrix multiply, defined as one of:

  1. \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\)

  2. \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + \mathbf{C}_{m\times n}\)

  3. \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n}\)

where \({\alpha}\) and \({\beta}\) are scalars (real or complex), and A, B, and C are matrices with dimensions \(A: m\times k\), \(B: k\times n\), and \(C: m\times n\), respectively.

The matrices can be column-major, row-major, or have a custom layout. See Execution Methods, Get Memory Layout, Suggested shared memory Layout, and Arrangement.

SM Operator#

cublasdx::SM<unsigned int CC>()

Sets the target architecture CC for the underlying BLAS function to use. Supported architectures are:

  • Volta: 700 and 720 (sm_70, sm_72).

  • Turing: 750 (sm_75).

  • Ampere: 800, 860 and 870 (sm_80, sm_86, sm_87).

  • Ada: 890 (sm_89).

  • Hopper: 900 (sm_90, sm_90a).

Note

When compiling cuBLASDx for 9.0a compute capability use 900 in the SM operator (see also CUDA C++ Programming Guide: Feature Availability).

Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.


Execution Operators#

Execution operators configure how the function will run on the GPU. Combined with Description Operators, they form a complete function descriptor that can be executed on a GPU.

Operator

Description

Block

Creates block execution object. See Block Configuration Operators.

Block Operator#

cublasdx::Block()

Generates a collective operation to run in a single CUDA block. Threads will cooperate to compute the collective operation. The layout and the number of threads participating in the execution, can be configured using Block Configuration Operators.

For example, the following code example creates a function descriptor for GEMM function that will run in a single CUDA block:

#include <cublasdx.hpp>

using GEMM = decltype(cublasdx::Size<32, 32, 64>()
              + cublasdx::Precision<double, __half, double>()
              + cublasdx::Type<cublasdx::type::real>()
              + cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
              + cublasdx::Function<cublasdx::function::MM>()
              + cublasdx::SM<700>()
              + cublasdx::Block());

Block Configuration Operators#

Block-configuration operators allow the user to configure block size of a single CUDA block.

Operators

Default value

Description

BlockDim<X, Y, Z>

Based on heuristics

Number of threads used to perform BLAS function.

Note

Block configuration operators can only be used with Block Operator.

Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different

will produce bit-identical results.

BlockDim Operator#

struct cublasdx::BlockDim<unsigned int X, unsigned int Y, unsigned int Z>()

Sets the CUDA block size to (X, Y, Z) to configure the execution, meaning it sets number of threads participating in the execution and their layout. Using this operator, user can run the BLAS function in 1D, 2D or 3D block with different number of threads. Set block dimension can be accessed via BLAS::block_dim trait.

Adding BlockDim<X, Y, Z> to the description puts the following requirements on the execution of the BLAS function:

  • Kernel must be launched with 3D block dimensions dim3(X1, Y1, Z1) where X1 >= X, Y1 >= Y, and Z1 >= Z, also:

    • For 1D BlockDim<X> kernel must be launched with dim3(X1, Y1, Z1) where X1 >= X.

    • For 2D BlockDim<X, Y> kernel must be launched with dim3(X, Y1, Z1) where Y1 >= Y.

    • For 3D BlockDim<X, Y, Z> kernel must be launched with dim3(X, Y, Z1) where Z1 >= Z.

  • X * Y * Z threads must be participating in the execution.

  • The participating threads must be consecutive (adjacent) threads.

The listed requirements may be lifted or loosened in the future releases of cuBLASDx.

Note

cuBLASDx can’t validate all kernel launch configuration at runtime and check that all requirements are met, thus it is user responsibility to adhere to the rules listed above. Violating those rules is considered undefined behavior and can lead to incorrect results and/or failures.

Examples

BlockDim<64>, kernel launched with block dimensions dim3(128, 1, 1) - OK
BlockDim<64>, kernel launched with block dimensions dim3(64, 4, 1) - OK
BlockDim<64>, kernel launched with block dimensions dim3(64, 2, 2) - OK
BlockDim<16, 16>, kernel launched with block dimensions dim3(16, 32, 1) - OK
BlockDim<16, 16>, kernel launched with block dimensions dim3(16, 16, 2) - OK
BlockDim<8, 8, 8>, kernel launched with block dimensions dim3(8, 8, 16) - OK

BlockDim<64>, kernel launched with block dimensions dim3(32, 1, 1) - INCORRECT
BlockDim<64>, kernel launched with block dimensions dim3(32, 2, 1) - INCORRECT
BlockDim<16, 16>, kernel launched with block dimensions dim3(256, 1, 1) - INCORRECT
BlockDim<8, 8, 8>, kernel launched with block dimensions dim3(512, 2, 1) - INCORRECT

The value of BlockDim can be accessed from BLAS description via BLAS::block_dim trait. When BlockDim is not set, the default block dimensions are used (the default value is BLAS::suggested_block_dim).

If the default block dimensions provided by cuBLASDx is smaller than the ones optimal for a kernel, it may still be optimal to try the default before increasing the number of threads contributing to the calculations.

Restrictions

  • X * Y * Z must be greater than or equal to 32.

Note

  • It’s recommended that X * Y * Z is 32, 64, 128, 256, 512, or 1024.

  • It’s recommended that X * Y * Z is a multiple of 32.