Operators#

Operators are used to describe the properties and to configure the execution of the problem we want to solve. They are divided into Description Operators and Execution Operators.


Description Operators#

Operator

Default Value

Description

Size<M, N, K>

Not set

Defines the problem size.

Function<function>

Not set

BLAS function. Use function::MM for GEMM.

Arrangement<arrangement, arrangement, arrangement>

row_major, col_major, col_major

Arrangement of each matrix in global memory and default arrangement in shared memory.

TransposeMode<transpose_mode, transpose_mode>

transposed, non-transposed

Transpose mode of the each matrix (Deprecated since 0.2.0).

Precision<PA, PB, PC>

float, float, float

Computation Precision of GEMM A, B, and C - either all floating point or all integral.

Type<type>

type::real

Type of input and output data (type::real or type::complex).

LeadingDimension<LDA, LDB, LDC>

As defined by Size, and Arrangement or TransposeMode operators

Leading dimensions for matrices A, B, C.

Alignment<a_align, b_align, c_align>

alignof(BLAS::a_value_type), alignof(BLAS::b_value_type), alignof(BLAS::c_value_type)

Alignments (in bytes) of matrices A, B, C.

SM<CC>

Not set

Target CUDA architecture for which the BLAS function should be generated.

Side<side>

side::left

Side of the equation on which the triangular matrix A appears. See TRSM.

FillMode<fill_mode>

fill_mode::lower

Which triangle of matrix A contains the data. See TRSM.

Diag<diag>

diag::non_unit

Whether the diagonal of A is stored (non_unit) or implicitly ones (unit). See TRSM.

BatchesPerBlock<N>

1

Number of independent TRSM instances solved per CUDA block. See TRSM.

Description operators define the problem we want to solve. Combined with Execution Operators, they form a complete function descriptor that can be executed on a GPU.

Operators are added (in arbitrary order) to construct the operation descriptor type. For example, to describe a matrix multiplication for non-transposed matrices A (m x k), B (k x n), C (m x n) with complex double precision values where m = 8, n = 16, k = 32 for execution on Ada architecture, one would write:

#include <cublasdx.hpp>

using GEMM = decltype(cublasdx::Size<8, 16, 32>()
              + cublasdx::Precision<double>()
              + cublasdx::Type<cublasdx::type::complex>()
              + cublasdx::Arrangement<cublasdx::col_major, cublasdx::col_major>()
              + cublasdx::Function<cublasdx::function::MM>()
              + cublasdx::SM<890>());

For a function descriptor to be complete, the following is required:

Size Operator#

cublasdx::Size<unsigned int M, unsigned int N, unsigned int K>()

Sets the problem size of the function to be executed.

For GEMM:

  • M - logical number of rows in matrices op(A) and C.

  • N - logical number of columns in matrices op(B) and C.

  • K - logical number of columns in matrix op(A) and rows in C.

For example, for GEMM M, N, and K specify that the A (M x K) matrix is multiplied by B (K x N) matrix which results in C (M x N) matrix (assuming A and B are non-transposed). See Arrangement, TransposeMode and GEMM.

Type Operator#

cubladx::Type<cublasdx::type T>;

namespace cublasdx {
  enum class type
  {
    real,
    complex
  };
}

Sets the type of input and output data used in computation. Use type::real for real data type, and type::complex for complex data type.

Precision Operator#

cublasdx::Precision<PA, PB=PA, PC=PA>;

Sets the precision of computation for A, B, and C.

The precision type can be either floating :

  • __half

  • float

  • double

  • __nv_fp8_e5m2

  • __nv_fp8_e4m3

  • __nv_bfloat16

  • cublasdx::tfloat32_t

or integral:

  • int8_t

  • uint8_t

  • int16_t

  • uint16_t

  • int32_t

  • uint32_t

  • int64_t

  • uint64_t

It describes the computation precision for the values used for input and output. This means that this is the type to which the input values will be converted just before executing the multiplication instruction.

Note

Starting from cuBLASDx 0.3.0, computational precision has been decoupled from data precision, i.e. the input / output data for each matrix can be of arbitrary type (even integral input for floating point GEMM) provided that Alignment Operator is set and at least one of those conditions is met:

  1. It’s implicitly convertible to the data type chosen with Precision Operator and Type Operator.

  2. For inputs: An appropriate converting loading operation is provided as one of the arguments. It takes the input type value. Its result must be at least implicitly convertible to the compute type.

  3. For output: An appropriate converting storing operation is provided as one of the arguments. It takes the result computational type (usually C type as defined by Precision Operator and Type Operator). Its result must be at least implicitly convertible to the output type.

Warning

If using computation precision decoupled from input types, the Alignment Operator must be explicitly set.

Arrangement Operator#

cublasdx::Arrangement<cublasdx::arrangement A, cublasdx::arrangement B = cublasdx::col_major, cublasdx::arrangement C = cublasdx::col_major>;

namespace cublasdx {
  enum class arrangement
  {
    col_major,
    row_major
  };

  inline constexpr auto col_major   = arrangement::col_major;
  inline constexpr auto left_layout = arrangement::col_major;
  inline constexpr auto row_major   = arrangement::row_major;
  inline constexpr auto right_major = arrangement::row_major;
}

Sets the order for global A, B, and C matrices used in the function. The order can be either column-major or row-major. Arrangement operator directly influences Get Memory Layout and Suggested shared memory Layout methods, by explicitly setting the default Get Memory Layout value and choosing Suggested shared memory Layout for optimized global-shared transfers.

Selecting a specific order for matrices doesn’t mean the function doesn’t accept matrices with different layouts, but it may influence the overall performance.

Warning

Arrangement and TransposeMode operators can’t be defined at the same time.

TransposeMode Operator#

Warning

TransposeMode operator is deprecated since 0.2.0 and may be removed in future versions.

cublasdx::TransposeMode<cublasdx::transpose_mode ATransposeMode, cublasdx::transpose_mode BTransposeMode>;

namespace cublasdx {
  enum class transpose_mode
  {
    non_transposed,
    transposed,
    conj_transposed,
  };

  inline constexpr auto N = transpose_mode::non_transposed;
  inline constexpr auto T = transpose_mode::transposed;
  inline constexpr auto C = transpose_mode::conj_transposed;
}

Sets the transpose mode for the A and B matrices used in the function. For example, TransposeMode<N, N>() sets transpose mode of A and B matrix as non-transposed for GEMM. Possible values for transpose mode are:

  • transpose_mode::non_transposed,

  • transpose_mode::transposed, and

  • transpose_mode::conj_transposed (conjugated transposed).

Warning

  • Arrangement and TransposeMode operators can’t be defined at the same time.

  • Using transpose_mode::non_transposed for a matrix in TransposeMode operator corresponds to arrangement::col_major in Arrangement operator.

  • Using transpose_mode::transposed for a matrix in TransposeMode operator corresponds to arrangement::row_major in Arrangement operator.

  • Using transpose_mode::conj_transposed for a matrix in TransposeMode operator corresponds to arrangement::row_major in Arrangement operator and cublasdx::conjugate passed in execute() method as transform operator for that matrix.

LeadingDimension Operator#

cublasdx::LeadingDimension<unsigned int LDA, unsigned int LDB, unsigned int LDC>()

Defines leading dimensions for matrices A, B, and C. The leading dimension of a matrix is a stride (in elements) to the beginning of the next column for a column-major matrix or the next row for a row-major matrix.

If Arrangement is used in the description of a general matrix multiplication operation, A, B, C matrices can be described in a following way:

  • Real dimensions of matrix A are \(LDA\times K\) with LDA >= M if A is column-major, and \(LDA\times M\) with LDA >= K otherwise.

  • Real dimensions of matrix B are \(LDB\times N\) with LDA >= K if B is column-major, and \(LDB\times K\) with LDB >= N otherwise.

  • Real dimensions of matrix C are \(LDC\times N\) with LDC >= M if C is column-major, and \(LDC\times M\) with LDB >= N otherwise.

A matrix can also be described using a layout (see CuTe: Layout), i.e. a pair of integer tuples: shape and stride (distance between elements). In simple terms, a shape represents the number of elements in each dimension, and stride represents distance between elements in each dimension.

  • A - matrix of \(M\times K\) shape, with 1 stride in 1st dimension and LDA in 2nd dimension if A is column-major, or with LDA in 1st and 1 in 2nd dimension otherwise.

  • B - matrix of \(K\times N\) shape, with 1 stride in 1st dimension and LDB in 2nd dimension if B is column-major, or with LDB in 1st and 1 in 2nd dimension otherwise.

  • C - matrix of \(M\times N\) shape, with 1 stride in 1st dimension and LDC in 2nd dimension if C is column-major, or with LDC in 1st and 1 in 2nd dimension otherwise.

Warning

TransposeMode operator is deprecated since 0.2.0 and may be removed in future versions.

If TransposeMode (deprecated) is used in the description of a general matrix multiplication operation, in BLAS nomenclature the dimensions of the A, B, C matrices can be described in a following way:

  • Real dimensions of matrix A are \(LDA\times K\) with LDA >= M if A is non-transposed, and \(LDA\times M\) with LDA >= K otherwise.

  • Real dimensions of matrix B are \(LDB\times N\) with LDA >= K if B is non-transposed, and \(LDB\times K\) with LDB >= N otherwise.

  • Real dimensions of matrix C are \(LDC\times N\) with LDC >= M.

See also, suggested_leading_dimension_of.

Alignment Operator#

cublasdx::Alignment<unsigned int AAlignment, unsigned int BAlignment, unsigned int CAlignment>()

cublasdx::MaxAlignment = Alignment<16, 16, 16> // alias of maximum supported alignements

Defines the alignments (in bytes) of the pointers to matrices A, B, and C (either raw pointers or wrapped in CuTe tensors) that are passed to the execute(...) method.

Alignment operator is “sticky”, which means that it will dictate alignment wherever it’s not specified explicitly, some examples are:

  • Alignment in bytes of dynamic leading dimensions (e.g. it’s incorrect to set Alignment to 16 and use leading dimension of 7 for fp16 datatype).

  • In Pipelining API it will describe:
    • byte alignment of global memory pointers,

    • byte alignment of global memory strides, and

    • byte alignment of shared memory pointers (may be bumped up in some cases, as pipeline owns the shared memory).

Note that the alignments have direct implication for how much shared memory is required for given a description operator. Additionally, it may also impact the performance.

Requirements:

  • AAlignments, BAlignments, CAlignments should be powers-of-two and less than or equal to the maximum supported alignments, i.e., 16 bytes.

  • AAlignments, BAlignments, CAlignments are multiple of alignment of either chosen compute value types, or input value types if decoupled precision is used.

Warning

If using computation precision decoupled from input types, the Alignment Operator must be explicitly set.

See also, Suggested Alignment Trait.

Function Operator#

cublasdx::Function<cublasdx::function F>()

namespace cublasdx {
  enum class function
  {
    MM,    // General matrix multiply (GEMM)
    TRSM   // Triangular solve
  };
}

Sets the BLAS function to be executed.

General Matrix Multiply#

Function<function::MM> sets the operation to general matrix multiply, defined as one of:

  1. \(\mathbf{C}_{m\times n} = {\alpha} \times \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n} + {\beta} \times \mathbf{C}_{m\times n}\)

  2. \(\mathbf{C}_{m\times n} = \mathbf{A}_{m\times k} \times \mathbf{B}_{k\times n}\)

where \({\alpha}\) and \({\beta}\) are scalars (real or complex), and A, B, and C are matrices with dimensions \(A: m\times k\), \(B: k\times n\), and \(C: m\times n\), respectively.

The matrices can be column-major, row-major, have a custom or suggested layout. See GEMM Execution Methods, Get Memory Layout, Suggested shared memory Layout, and Arrangement.

Triangular Solve (TRSM)#

Function<function::TRSM> sets the operation to triangular solve. It solves one of:

  1. \(A \times X = B\) — left-side: Side<side::left>

  2. \(X \times A = B\) — right-side: Side<side::right>

where A is a triangular matrix and B is overwritten in-place with the solution X. There is no alpha or beta scalar — the solve is always \(B \leftarrow A^{-1} B\) (left) or \(B \leftarrow B A^{-1}\) (right).

A TRSM descriptor requires Size, SM, and an execution operator. The following operators are optional and default to the values shown:

  • Side — default: side::left

  • FillMode — default: fill_mode::lower

  • Diag — default: diag::non_unit

Size for TRSM takes only two dimensions Size<M, N>: M rows and N columns of B. The triangular matrix A is square with dimension M (left-side) or N (right-side).

Note

TRSM requires linking against the pre-built libcublasdx.fatbin device-code library. A compile-time static_assert is emitted if the CMake target is set up incorrectly.

See Quick Installation Guide for more information regarding linking setup in your CMake project or directly with NVCC.

Example descriptor for a left-side, lower-triangular, non-unit TRSM:

#include <cublasdx.hpp>

using TRSM = decltype(cublasdx::Size<32, 16>()
              + cublasdx::Precision<float>()
              + cublasdx::Type<cublasdx::type::real>()
              + cublasdx::Function<cublasdx::function::TRSM>()
              + cublasdx::Side<cublasdx::side::left>()
              + cublasdx::FillMode<cublasdx::fill_mode::lower>()
              + cublasdx::Diag<cublasdx::diag::non_unit>()
              + cublasdx::SM<900>()
              + cublasdx::Block());

SM Operator#

cublasdx::SM<unsigned int CC, sm_modifier Modifier = cublasdx::generic>()

Sets the target architecture CC for the underlying BLAS function to use. Supported architectures are:

  • Turing: 750 (sm_75).

  • Ampere: 800, 860 and 870 (sm_80, sm_86, sm_87).

  • Ada: 890 (sm_89).

  • Hopper: 900 (sm_90).

  • Blackwell: 1000, 1100, 1030, 1200, 1210 (sm_100, sm_103, sm_110, sm_120, sm_121).

Each architecture can have a modifier:

  • cublasdx::generic, portable PTX generation, default option.

  • cublasdx::arch_specific, potentially non-portable PTX generation, accelerated mode enabling WGMMA / 1SM UTCMMA.
    • This implies the use of accelerated target in compilation, e.g. sm_100a instead of sm_100.

  • cublasdx::family_specific, potentially non-portable PTX generation, family.
    • This implies the use of family target in compilation, e.g. sm_100f instead of sm_100.

Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.

EnableInputStreaming Operator#

cublasdx::EnableInputStreaming()

Informs the library that no load transforms will be applied to inputs, thus allowing streaming of input data from shared memory directly to MMA units without passing through registers.

cuBLASDx allows to apply non-destructive in-flight per-element transforms to each of the elements of A and B inputs, just before they are passed to actual multiplication. This requires all elements to pass through register memory, which prevents the use of WGMMA and 1SM UTCMMA instructions.

The use of this operator limits preprocessing capabilities but unlocks potentially greater performance of data movement by letting the library handle it fully.


TRSM-Specific Operators#

The following operators are used exclusively with Function<function::TRSM>.

Side Operator#

cublasdx::Side<cublasdx::side S>()

namespace cublasdx {
  enum class side
  {
    left,   // Solve A * X = B
    right,  // Solve X * A = B
  };
}

Specifies on which side of the equation the triangular matrix A appears.

  • side::left — solves \(A \times X = B\).

  • side::right — solves \(X \times A = B\).

FillMode Operator#

cublasdx::FillMode<cublasdx::fill_mode FM>()

namespace cublasdx {
  enum class fill_mode
  {
    lower,  // Lower triangle of A is stored
    upper,  // Upper triangle of A is stored
  };
}

Specifies which triangular part of matrix A holds the meaningful data. The other triangle is not accessed.

Diag Operator#

cublasdx::Diag<cublasdx::diag D>()

namespace cublasdx {
  enum class diag
  {
    non_unit,  // Diagonal of A is read from the matrix
    unit,      // Diagonal of A is implicitly 1; diagonal elements are not accessed
  };
}

Specifies whether the diagonal elements of A are explicitly stored (non_unit) or treated as ones without reading them (unit).

BatchesPerBlock Operator#

cublasdx::BatchesPerBlock<unsigned int N>()

Controls how many independent TRSM problem instances (batches) a single CUDA block processes. When omitted, defaults to 1. Use BLAS::suggested_batches_per_block to query the library-recommended value for your descriptor and tune accordingly.

When N > 1, the shared memory tensors passed to execute() must carry a batch mode as their third index. Use BLAS::get_layout_smem_a() and BLAS::get_layout_smem_b() to obtain layouts that include the batch stride.

Traits

  • BLAS::batches_per_block — the selected (or default) value of N.

  • BLAS::suggested_batches_per_block — library-recommended value for this descriptor.


Execution Operators#

Execution operators configure how the function will run on the GPU. Combined with Description Operators, they form a complete function descriptor that can be executed on a GPU.

Operator

Description

Block

Creates block execution object. See Block Configuration Operators.

Thread

Creates thread execution object (TRSM only).

Block Operator#

cublasdx::Block()

Generates a collective operation to run in a single CUDA block. Threads will cooperate to compute the collective operation. The layout and the number of threads participating in the execution, can be configured using Block Configuration Operators.

For example, the following code example creates a function descriptor for GEMM function that will run in a single CUDA block:

#include <cublasdx.hpp>

using GEMM = decltype(cublasdx::Size<32, 32, 64>()
              + cublasdx::Precision<double, __half, double>()
              + cublasdx::Type<cublasdx::type::real>()
              + cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
              + cublasdx::Function<cublasdx::function::MM>()
              + cublasdx::SM<890>()
              + cublasdx::Block());

Thread Operator#

cublasdx::Thread()

Generates an independent operation to run in a single CUDA thread. Each thread solves its own problem instance independently, without cooperation with other threads. Currently only supported for TRSM.

For example, the following code example creates a function descriptor for TRSM function that will run independently per thread:

#include <cublasdx.hpp>

using TRSM = decltype(cublasdx::Size<32, 16>()
              + cublasdx::Precision<float>()
              + cublasdx::Type<cublasdx::type::real>()
              + cublasdx::Function<cublasdx::function::TRSM>()
              + cublasdx::Side<cublasdx::side::left>()
              + cublasdx::FillMode<cublasdx::fill_mode::lower>()
              + cublasdx::Diag<cublasdx::diag::non_unit>()
              + cublasdx::SM<900>()
              + cublasdx::Thread());

Block Configuration Operators#

Block-configuration operators allow the user to configure block size of a single CUDA block.

Operators

Default value

Description

BlockDim<X, Y, Z>

Based on heuristics

Number of threads used to perform BLAS function.

Note

Block configuration operators can only be used with Block Operator.

Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different

will produce bit-identical results.

BlockDim Operator#

struct cublasdx::BlockDim<unsigned int X, unsigned int Y, unsigned int Z>()

Sets the CUDA block size to (X, Y, Z) to configure the execution, meaning it sets number of threads participating in the execution and their layout. Using this operator, user can run the BLAS function in 1D, 2D or 3D block with different number of threads. Set block dimension can be accessed via BLAS::block_dim trait.

Adding BlockDim<X, Y, Z> to the description puts the following requirements on the execution of the BLAS function:

For GEMM:

  • Kernel must be launched with 3D block dimensions dim3(X1, Y1, Z1) where X1 >= X, Y1 >= Y, and Z1 >= Z, also:

    • For 1D BlockDim<X> kernel must be launched with dim3(X1, Y1, Z1) where X1 >= X.

    • For 2D BlockDim<X, Y> kernel must be launched with dim3(X, Y1, Z1) where Y1 >= Y.

    • For 3D BlockDim<X, Y, Z> kernel must be launched with dim3(X, Y, Z1) where Z1 >= Z.

  • X * Y * Z threads must be participating in the execution.

  • The participating threads must be consecutive (adjacent) threads.

The listed requirements may be lifted or loosened in the future releases of cuBLASDx.

For TRSM:

  • The kernel must be launched with exactly the block dimensions BLAS::block_dim, which equals BlockDim<X, Y, Z> if the BlockDim operator is specified, or BLAS::suggested_block_dim if not. Unlike GEMM, TRSM does not allow launching with block dimensions other than specified.

Note

cuBLASDx can’t validate all kernel launch configuration at runtime and check that all requirements are met, thus it is user responsibility to adhere to the rules listed above. Violating those rules is considered undefined behavior and can lead to incorrect results and/or failures.

Examples

BlockDim<64>, kernel launched with block dimensions dim3(128, 1, 1) - OK
BlockDim<64>, kernel launched with block dimensions dim3(64, 4, 1) - OK
BlockDim<64>, kernel launched with block dimensions dim3(64, 2, 2) - OK
BlockDim<16, 16>, kernel launched with block dimensions dim3(16, 32, 1) - OK
BlockDim<16, 16>, kernel launched with block dimensions dim3(16, 16, 2) - OK
BlockDim<8, 8, 8>, kernel launched with block dimensions dim3(8, 8, 16) - OK

BlockDim<64>, kernel launched with block dimensions dim3(32, 1, 1) - INCORRECT
BlockDim<64>, kernel launched with block dimensions dim3(32, 2, 1) - INCORRECT
BlockDim<16, 16>, kernel launched with block dimensions dim3(256, 1, 1) - INCORRECT
BlockDim<8, 8, 8>, kernel launched with block dimensions dim3(512, 2, 1) - INCORRECT

The value of BlockDim can be accessed from BLAS description via BLAS::block_dim trait. When BlockDim is not set, the default block dimensions are used (the default value is BLAS::suggested_block_dim).

If the default block dimensions provided by cuBLASDx is smaller than the ones optimal for a kernel, it may still be optimal to try the default before increasing the number of threads contributing to the calculations.

Restrictions

  • X * Y * Z must be greater than or equal to 32.

Note

  • It’s recommended that X * Y * Z is 32, 64, 128, 256, 512, or 1024.

  • It’s recommended that X * Y * Z is a multiple of 32.

StaticBlockDim Operator#

struct cublasdx::StaticBlockDim()

Informs the library that the runtime configuration of CUDA block will be exactly the same as BlockDim passed to the descriptor.

This may slightly improve performance, as no extra runtime checks are necessary.

WithPipeline Operator#

cublasdx::WithPipeline()

Informs the library that this tile matrix multiplication description will be used only with cublasdx Pipelining extension.

Warning

It is incorrect to use WithPipeline operator and execute GEMM with regular tile interface.

For example, the following code example creates a function descriptor for GEMM function that will run in a pipeline:

#include <cublasdx.hpp>

using GEMM = decltype(cublasdx::Size<32, 32, 64>()
                      + cublasdx::Precision<double, __half, double>()
                      + cublasdx::Type<cublasdx::type::real>()
                      + cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
                      + cublasdx::Function<cublasdx::function::MM>()
                      + cublasdx::SM<890>()
                      + cublasdx::Block()
                      + cublasdx::WithPipeline());