Operators

Operators are used to describe the properties and to configure the execution of the problem we want to solve. They are divided into Description Operators and Execution Operators.


Description Operators

Operator

Default Value

Description

Size<M, N, K>

Not set

Defines the problem size.

Function<function>

Not set

BLAS function. Use function::MM for GEMM.

TransposeMode<transpose_mode, transpose_mode>

Not set

Transpose mode of the each matrix.

Precision<P>

float

Precision of the floating-point values used for data and computation - half, double or float.

Type<type>

type::real

Type of input and output data (type::real or type::complex).

LeadingDimension<LDA, LDB, LDC>

As defined by Size and TransposeMode operators

Leading dimensions for matrices A, B, C.

SM<CC>

Not set

Target CUDA architecture for which the BLAS function should be generated.

Description operators define the problem we want to solve. Combined with Execution Operators, they form a complete function descriptor that can be executed on a GPU.

Operators are added (in arbitrary order) to construct the operation descriptor type. For example, to describe a matrix multiplication for non-transposed matrices A (m x k), B (k x n), C (m x n) with complex double precision values where m = 8, n = 16, k = 32 for execution on Volta architecture, one would write:

#include <cublasdx.hpp>

using GEMM = decltype(cublasdx::Size<8, 16, 32>()
              + cublasdx::Precision<double>()
              + cublasdx::Type<cublasdx::type::complex>()
              + cublasdx::TransposeMode<cublasdx::N, cublasdx::N>()
              + cublasdx::Function<cublasdx::function::MM>()
              + cublasdx::SM<700>());

For a function descriptor to be complete, the following is required:

Size Operator

cublasdx::Size<unsigned int M, unsigned int N, unsigned int K>()

Sets the problem size of the function to be executed.

For GEMM:

  • M - logical number of rows in matrices op(A) and C.

  • N - logical number of columns in matrices op(B) and C.

  • K - logical number of columns in matrix op(A) and rows in C.

For example, for GEMM M, N, and K specify that the A (M x K) matrix is multiplied by B (K x N) matrix which results in C (M x N) matrix (assuming A and B are non-transposed). See TransposeMode and GEMM.

Type Operator

cubladx::Type<cublasdx::type T>()

namespace cublasdx {
  enum class type
  {
    real,
    complex
  };
}

Sets the type of input and output data used in computation. Use type::real for real data type, and type::complex for complex data type.

Precision Operator

cublasdx::Precision<P>()

Sets the floating-point precision used in computation. Type P - half, float or double, is the type of the values used for input and output, as well as the underlying type of the values used to compute.

TransposeMode Operator

cublasdx::TransposeMode<cublasdx::transpose_mode ATransposeMode, cublasdx::transpose_mode BTransposeMode>()

namespace cublasdx {
  enum class transpose_mode
  {
    non_transposed,
    transposed,
    conj_transposed,
  };

  inline constexpr auto N = transpose_mode::non_transposed;
  inline constexpr auto T = transpose_mode::transposed;
  inline constexpr auto C = transpose_mode::conj_transposed;
}

Sets the transpose mode for the A and B matrices used in the function. For example, TransposeMode<N, N>() sets transpose mode of A and B matrix as non-transposed for GEMM. Possible values for transpose mode are:

  • transpose_mode::non_transposed,

  • transpose_mode::transposed, and

  • transpose_mode::conj_transposed (conjugated transposed).

LeadingDimension Operator

cublasdx::LeadingDimension<unsigned int LDA, unsigned int LDB, unsigned int LDC>()

Defines leading dimensions for matrices A, B, and C. The leading dimension is the stride (in elements) to the beginning of next column in memory. The leading dimension always refers to the length of the first dimension of \(Op(A)\), \(Op(B)\) or \(C\): (see TransposeMode and GEMM) as described below:

  • Real dimensions of matrix A are LDA x K with LDA >= M if A is non-transposed, and LDA x M with LDA >= K otherwise.

  • Real dimensions of matrix B are LDB x N with LDA >= K if B is non-transposed, and LDB x K with LDB >= N otherwise.

  • Real dimensions of matrix C are LDC x N with LDC >= M.

See also, suggested_leading_dimension_of.

Function Operator

cublasdx::Function<cublasdx::function F>()

namespace cublasdx {
  enum class function
  {
    MM
  };
}

Sets the BLAS function to be executed.

General Matrix Multiply

Function<function::MM> sets the operation to general matrix multiply, defined as:

\(C = {\alpha} * op(A) * op(B) + {\beta} * C\)

where \({\alpha}\) and \({\beta}\) are scalars (real or complex), and A, B, and C are column-major matrices with dimensions \(op(A): M x K\), \(op(B): K x N\), and \(C: M x N\), respectively.

\(op(A)\) for is defined as:

\[\begin{split}op(A) = \left\{\begin{matrix} A, \text{if TransposeMode<N, ...> is set} \\ A^T, \text{if TransposeMode<T, ...> is set} \\ A^H, \text{if TransposeMode<C, ...> is set} \end{matrix}\right.\end{split}\]

\(op(B)\) for is defined as:

\[\begin{split}op(B) = \left\{\begin{matrix} B, \text{if TransposeMode<..., N> is set} \\ B^T, \text{if TransposeMode<..., T> is set} \\ B^H, \text{if TransposeMode<..., C> is set} \end{matrix}\right.\end{split}\]

SM Operator

cublasdx::SM<unsigned int CC>()

Sets the target architecture CC for the underlying BLAS function to use. Supported architectures are:

  • Volta: 700 and 720 (sm_70, sm_72).

  • Turing: 750 (sm_75).

  • Ampere: 800, 860 and 870 (sm_80, sm_86, sm_87).

  • Ada: 890 (sm_89).

  • Hopper: 900 (sm_90, sm_90a).

Note

When compiling cuBLASDx for 9.0a compute capability use 900 in the SM operator (see also CUDA C++ Programming Guide: Feature Availability).

Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.


Execution Operators

Execution operators configure how the function will run on the GPU. Combined with Description Operators, they form a complete function descriptor that can be executed on a GPU.

Operator

Description

Block

Creates block execution object. See Block Configuration Operators.

Block Operator

cublasdx::Block()

Generates a collective operation to run in a single CUDA block. Threads will cooperate to compute the collective operation. The layout and the number of threads participating in the execution, can be configured using Block Configuration Operators.

For example, the following code example creates a function descriptor for GEMM function that will run in a single CUDA block:

#include <cublasdx.hpp>

using GEMM = decltype(cublasdx::Size<32, 32, 64>()
              + cublasdx::Precision<double>()
              + cublasdx::Type<cublasdx::type::real>()
              + cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
              + cublasdx::Function<cublasdx::function::MM>()
              + cublasdx::SM<700>()
              + cublasdx::Block());

Block Configuration Operators

Block-configuration operators allow the user to configure block size of a single CUDA block.

Operators

Default value

Description

BlockDim<X, Y, Z>

Based on heuristics

Number of threads used to perform BLAS function.

Note

Block configuration operators can only be used with Block Operator.

Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different

will produce bit-identical results.

BlockDim Operator

struct cublasdx::BlockDim<unsigned int X, unsigned int Y, unsigned int Z>()

Sets the CUDA block size to (X, Y, Z) to configure the execution, meaning it sets number of threads participating in the execution and their layout. Using this operator, user can run the BLAS function in 1D, 2D or 3D block with different number of threads. Set block dimension can be accessed via BLAS::block_dim trait.

Adding BlockDim<X, Y, Z> to the description puts the following requirements on the execution of the BLAS function:

  • Kernel must be launched with 3D block dimensions dim3(X1, Y1, Z1) where X1 >= X, Y1 >= Y, and Z1 >= Z, also:

    • For 1D BlockDim<X> kernel must be launched with dim3(X1, Y1, Z1) where X1 >= X.

    • For 2D BlockDim<X, Y> kernel must be launched with dim3(X, Y1, Z1) where Y1 >= Y.

    • For 3D BlockDim<X, Y, Z> kernel must be launched with dim3(X, Y, Z1) where Z1 >= Z.

  • X * Y * Z threads must be participating in the execution.

  • The participating threads must be consecutive (adjacent) threads.

The listed requirements may be lifted or loosened in the future releases of cuBLASDx.

Note

cuBLASDx can’t validate all kernel launch configuration at runtime and check that all requirements are met, thus it is user responsibility to adhere to the rules listed above. Violating those rules is considered undefined behavior and can lead to incorrect results and/or failures.

Examples

BlockDim<64>, kernel launched with block dimensions dim3(128, 1, 1) - OK
BlockDim<64>, kernel launched with block dimensions dim3(64, 4, 1) - OK
BlockDim<64>, kernel launched with block dimensions dim3(64, 2, 2) - OK
BlockDim<16, 16>, kernel launched with block dimensions dim3(16, 32, 1) - OK
BlockDim<16, 16>, kernel launched with block dimensions dim3(16, 16, 2) - OK
BlockDim<8, 8, 8>, kernel launched with block dimensions dim3(8, 8, 16) - OK

BlockDim<64>, kernel launched with block dimensions dim3(32, 1, 1) - INCORRECT
BlockDim<64>, kernel launched with block dimensions dim3(32, 2, 1) - INCORRECT
BlockDim<16, 16>, kernel launched with block dimensions dim3(256, 1, 1) - INCORRECT
BlockDim<8, 8, 8>, kernel launched with block dimensions dim3(512, 2, 1) - INCORRECT

The value of BlockDim can be accessed from BLAS description via BLAS::block_dim trait. When BlockDim is not set, the default block dimensions are used (the default value is BLAS::suggested_block_dim).

If the default block dimensions provided by cuBLASDx is smaller than the ones optimal for a kernel, it may still be optimal to try the default before increasing the number of threads contributing to the calculations.

Restrictions

  • X * Y * Z must be greater than or equal to 32.

Note

  • It’s recommended that X * Y * Z is 32, 64, 128, 256, 512, or 1024.

  • It’s recommended that X * Y * Z is a multiple of 32.