Traits#

Traits provide the user with information about the function descriptor constructed using Operators. They are divided into Description Traits and Execution Traits.

Description Traits#

Trait	Description
`size_of<Description>`	Problem size we intend to solve.
`type_of<Description>`	Data type used, either `type::real` or `type::complex`.
`precision_of<Description>`	Compute precision of the underlying floating-point values used for `A`, `B`, and `C`.
`function_of<Description>`	Function to be executed.
`arrangement_of<Description>`	Arrangement mode of the matrix - `arrangement::col_major`, and `arrangement::row_major`.
`transpose_mode_of<Description>`	Transpose mode of the matrix - `transpose_mode::non_transposed`, `transpose_mode::transposed`, and `transpose_mode::conj_transposed`.
`alignment_of<Description>`	Alignments (in bytes) of matrices `A`, `B`, and `C`.
`leading_dimension_of<Description>`	Provides leading dimensions as defined by `Size`, and `Arrangement` or `TransposeMode` operators.
`sm_of<Description>`	Target architecture for the underlying computation.
`is_blas<Description>`	`true` if `Description` is a function description, formed with Description Operators.
`is_blas_execution<Description>`	`true` if `Description` is a function description, configured for execution with Execution Operators.
`is_complete_blas<Description>`	`true` if `Description` is a valid and complete function description, formed with Description Operators.
`is_complete_blas_execution<Description>`	`true` if `is_complete_blas<Description>` is `true` and `is_blas_execution<Description>` is `true`.

Description traits can be retrieved from an function descriptor using the helper functions provided. For example:

#include <cublasdx.hpp>

using GEMM = decltype(cublasdx::Size<32, 32, 64>()
              + cublasdx::Precision<float, float, double>()
              + cublasdx::Type<cublasdx::type::real>()
              + cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
              + cublasdx::Function<cublasdx::function::MM>()
              + cublasdx::SM<700>()
              + cublasdx::Block());

if(cublasdx::is_complete_blas<GEMM>::value)
  std::cout << "GEMM (M x N x K): "
    << cublasdx::size_of<GEMM>::m << " x "
    << cublasdx::size_of<GEMM>::n << " x "
    << cublasdx::size_of<GEMM>::k << std::endl;

Size Trait#

// std::tuple<unsigned int, unsigned int, unsigned int>
cublasdx::size_of<BLAS>::value
cublasdx::size_of_v<BLAS>

// unsigned int
cublasdx::size_of<BLAS>::m
cublasdx::size_of_v_m<BLAS>

cublasdx::size_of<BLAS>::n
cublasdx::size_of_v_n<BLAS>

cublasdx::size_of<BLAS>::k
cublasdx::size_of_v_k<BLAS>

size_of trait gives size of the problem we want to solve, as set by Size Operator. If the descriptor was not created using a Size Operator, compilation will fail with an error message.

Type Trait#

// cublasdx::type
cublasdx::type_of<BLAS>::value
cublasdx::type_of_v<BLAS>

Data type (cublasdx::type::real or cublasdx::type::complex) used in the function, as set by Type Operator.

Precision Trait#

// Precision type
cublasdx::precision_of<BLAS>::a_type
cublasdx::precision_of_a_t<BLAS>

cublasdx::precision_of<BLAS>::b_type
cublasdx::precision_of_b_t<BLAS>

cublasdx::precision_of<BLAS>::c_type
cublasdx::precision_of_c_t<BLAS>

Compute precision of the input data A and B and output data C, as set by Precision Operator.

Starting with cuBLASDx 0.3.0 input precision has been decoupled from compute precision. Refer to Precision Operator for more details.

Function Trait#

// cublasdx::function
cublasdx::function_of<BLAS>::value
cublasdx::function_of_v<BLAS>

Function to be executed, as set by Function Operator. If the descriptor was not created using a Function Operator, compilation will fail with an error message.

Arrangement Trait#

// std::tuple<arrangement, arrangement, arrangement>
cublasdx::arrangement_of<BLAS>::value
cublasdx::arrangement_of_v<BLAS>

// cublasdx::arrangement
cublasdx::arrangement_of<BLAS>::a
cublasdx::arrangement_of_v_a<BLAS>

cublasdx::arrangement_of<BLAS>::b
cublasdx::arrangement_of_v_b<BLAS>

cublasdx::arrangement_of<BLAS>::c
cublasdx::arrangement_of_v_c<BLAS>

Arrangement of global A, B, and C matrices.

Transpose Mode Trait#

Warning

transpose_mode_of trait is deprecated since 0.2.0 and may be removed in future versions.

// cublasdx::transpose_mode
cublasdx::transpose_mode_of<BLAS>::a_transpose_mode
cublasdx::transpose_mode_of_a<BLAS>

cublasdx::transpose_mode_of<BLAS>::b_transpose_mode
cublasdx::transpose_mode_of_b<BLAS>

Transpose mode of A, and B matrices.

Alignment Trait#

// std::tuple<unsigned int, unsigned int, unsigned int>
cublasdx::alignment_of<BLAS>::value
cublasdx::alignment_of_v<BLAS>

// unsigned int
cublasdx::alignment_of<BLAS>::a
cublasdx::alignment_of_v_a<BLAS>

cublasdx::alignment_of<BLAS>::b
cublasdx::alignment_of_v_b<BLAS>

cublasdx::alignment_of<BLAS>::c
cublasdx::alignment_of_v_c<BLAS>

Alignment (in bytes) of input matrices A, B, and C.

leading_dimension_of#

// std::tuple<unsigned int, unsigned int, unsigned int>
cublasdx::leading_dimension_of<BLAS>::value
cublasdx::leading_dimension_of_v<BLAS>

// unsigned int
cublasdx::leading_dimension_of_v_a<BLAS>
cublasdx::leading_dimension_of_v_b<BLAS>
cublasdx::leading_dimension_of_v_c<BLAS>

Type cublasdx::leading_dimension_of provides leading dimensions for matrices A, B, and C. Returned values are defined by Size, and Arrangement or TransposeMode operators.

Requirements:

BLAS must have defined size, function, and transpose mode. See Description Operators section.

BLAS must include Block operator.

SM Trait#

// unsigned int
cublasdx::sm_of<BLAS>::value
cublasdx::sm_of_v<BLAS>

GPU architecture used to run the function. For example, gives 700 for Volta (sm_70).

is_blas Trait#

// bool
cublasdx::is_blas<BLAS>::value
cublasdx::is_blas_v<BLAS>

Trait is true if the descriptor is a function description, formed with Description Operators.

is_blas_execution Trait#

// bool
cublasdx::is_blas_execution<BLAS>::value
cublasdx::is_blas_execution_v<BLAS>

Trait is true if the descriptor is a function description configured for execution, formed with Description Operators and Execution Operators.

is_complete_blas Trait#

// bool
cublasdx::is_complete_blas<BLAS>::value
cublasdx::is_complete_blas_v<BLAS>

Trait is true if the descriptor is a complete function description, formed with Description Operators.

Note

Complete in this context means that the descriptor has been formed with all the necessary Description Operators and it is only missing an Execution Operators to be able to run.

For an function descriptor to be complete, the following is required:

One, and only one, Size Operator.

One, and only one, Function Operator.

One, and only one, SM Operator.

is_complete_blas_execution Trait#

// bool
cublasdx::is_complete_blas_execution<BLAS>::value
cublasdx::is_complete_blas_execution_v<BLAS>

Trait is true if both is_blas_execution Trait and is_complete_blas Trait are true.

Note

If is_complete_blas_execution Trait is true for a descriptor descriptor, then we can use the Execution Methods to execute the function.

Execution Traits#

Execution traits can be retrieved directly from a BLAS descriptor that has been configured with Execution Operators. The available execution traits may depend on the operator used to build the descriptor. Right now, Block Operator is the only available execution operator.

Block Traits#

Trait	Default value	Description
`Description::<a/b/c>_value_type`	`float`, `float`, `float`	Type of compute data `A`, `B`, `C`.
`Description::<a/b/c>_dim`	Determined by the problem size and transpose mode.	Logical dimensions of `A`, `B`, `C` matrices determined by the problem size and transpose mode.
`Description::ld<a/b/c>`	Determined by the problem size or set via LeadingDimension Operator	Leading dimensions of matrices `A`, `B`, `C`.
`Description::<a/b/c>_alignment`	Determined by the type of matrices `A`, `B`, `C` or set via Alignment Operator	Alignments (in bytes) of input matrices `A`, `B`, `C`.
`Description::<a/b/c>_size`	Determined by the problem size and set leading dimensions.	Number of elements in `A`, `B`, `C` matrices. Includes padding determined by set leading dimensions.
`Description::block_dim`	Base on heuristic or set via BlockDim Operator	Value of type `dim3` representing number of threads computing BLAS function. Set or recommended CUDA block dimensions.
`Description::suggested_block_dim`	Based on heuristic.	Recommended number of threads for computing BLAS function represented as CUDA block dimensions (`dim3` value).
`Description::max_threads_per_block`	`X * Y * Z` where `X, Y, Z = Description::block_dim`	Number of threads in `Description::block_dim`.

Block traits can be retrieved from descriptors built with Block Operator.

For example:

#include <cublasdx.hpp>

using BLAS = decltype(... + Block());

template<class BLAS>
__launch_bounds__(BLAS::max_threads_per_block)
__global__ void kernel(typename BLAS::a_value_type* input_a, ...) {
  ...
}

Value Type Trait#

BLAS::a_value_type
BLAS::b_value_type
BLAS::c_value_type

Compute types for A, B and C matrices types. See Supported MMA Data Types.

Determined by Precision and Type.

The default type is float. It can be a real or complex type depending on the Type operator set.

Note

Starting from cuBLASDx 0.3.0, computational precision has been decoupled from data precision, i.e. the input / output data for each matrix can be of arbitrary type (even integral input for floating point GEMM) provided that Alignment Operator is set and at least one of those conditions is met:

It’s implicitly convertible to the data type chosen with Precision Operator and Type Operator.
For inputs: An appropriate converting loading operation is provided as one of the arguments. It takes the input type value. Its result must be at least implicitly convertible to the compute type.
For output: An appropriate converting storing operation is provided as one of the arguments. It takes the result computational type (usually C type as defined by Precision Operator and Type Operator). Its result must be at least implicitly convertible to the output type.

The tables below present the value types for A, B, and C matrices depending on the corresponding precisions set in Precision operator and type set via Type operator. Please refer to these places for a full list of supported precisions.

Columns marked as alternatives present types that can be safely used instead of <a/b/c>_value_type types. Other types can also be used as long as their alignment and size are the same as those of <a/b/c>_value_type.

Precision \ Type	real
Precision \ Type	<a/b/c>_value_type	Alternative#1
__nv_fp8_e5m2	__nv_fp8_e5m2	cublasdx::float_e5m2_t
__nv_fp8_e4m3	__nv_fp8_e4m3	cublasdx::float_e4m3_t
__half	__half	cublasdx::half_t
__nv_bfloat16	__nv_bfloat16	cublasdx::bfloat16_t
cublasdx::tfloat32_t	cublasdx::tfloat32_t	cublasdx::tfloat32_t
other	other	N / A

Precision \ Type	complex
Precision \ Type	<a/b/c>_value_type	Alternative#1	Alternative#2
__nv_fp8_e5m2	cublasdx::complex<__nv_fp8_e5m2>	cublasdx::complex<cublasdx::float_e5m2_t>	__nv_fp8x2_e5m2
__nv_fp8_e4m3	cublasdx::complex<__nv_fp8_e4m3>	cublasdx::complex<cublasdx::float_e4m3_t>	__nv_fp8x2_e4m3
__half	cublasdx::complex<__half>	cublasdx::complex<cublasdx::half_t>	__half2
__nv_bfloat16	cublasdx::complex<__nv_bfloat16>	cublasdx::complex<cublasdx::bfloat16_t>	__nv_bfloat162
other	cublasdx::complex<other>	N / A	N / A

Note

Types cublasdx::float_e5m2_t, cublasdx::float_e4m3_t, cublasdx::half_t, cublasdx::bfloat16_t, and cublasdx::tfloat32_t are aliases for CUTLASS types cutlass::float_e5m2_t, cutlass::float_e4m3_t, cutlass::half_t, cutlass::bfloat16_t, and cutlass::tfloat32_t.

Matrix Dim Trait#

Warning

Matrix dimension traits are deprecated since 0.2.0 and may be removed in future versions.

// tuple<unsigned int, unsigned int>
BLAS::a_dim
BLAS::b_dim
BLAS::c_dim

Logical dimensions of matrices A, B, C in form of (rows, columns) tuple. The dimensions are determined by the problem size, arrangement and transpose mode.

See GEMM, Size Operator, Arrangement Operator and TransposeMode Operator.

Leading Dimension Trait#

// unsigned int
BLAS::lda
BLAS::ldb
BLAS::ldc

Leading dimensions of matrices A, B, C.

See GEMM, Size Operator and LeadingDimension Operator.

Matrix Size Trait#

// unsigned int
BLAS::a_size
BLAS::b_size
BLAS::c_size

Number of elements in A, B, C matrices. It includes padding determined by set leading dimensions.

See GEMM, Size Operator and LeadingDimension Operator.

Alignment Trait#

// unsigned int
BLAS::a_alignment
BLAS::b_alignment
BLAS::c_alignment

Alignments (in bytes) of matrices A, B, C. See Alignment Operator.

Block Dim Trait#

// dim3
BLAS::block_dim

Value of type dim3 representing number of threads that will be used to perform requested BLAS function. It is equal to the specified (or default if not) CUDA block dimensions. See BlockDim Operator.

If BlockDim is not used in BLAS description the default value of BLAS::block_dim is equal to BLAS::suggested_block_dim.

Suggested Block Dim Trait#

// dim3
BLAS::suggested_block_dim

Recommended number of threads for BLAS description.

Max Threads Per Block Trait#

BLAS::max_threads_per_block

Number of thread is recommended or set block dimensions, X * Y * Z where X, Y, Z = Description::block_dim.

Other Traits#

Trait	Description
`is_supported_smem_restrict<Description, Arch>`	Verifies if given BLAS operation is supported with shared memory API, without aliasing matrices, on the provided CUDA architecture (`Architecture`).
`is_supported_rmem_restrict<Description, Arch>`	Verifies if given BLAS operation is supported with register API, without aliasing matrices, on the provided CUDA architecture (`Architecture`).
`suggested_leading_dimension_of<Description, Arch>`	Provides leading dimensions that may improve performance of BLAS operation.
`suggested_alignment_of<Description>`	Suggested alignments (in bytes) of matrices `A`, `B`, and `C`.

is_supported_smem_restrict#

namespace cublasdx {
  // BLAS - BLAS description without CUDA architecture defined using SM operator
  // Architecture - unsigned integer representing CUDA architecture (SM)
  template<class BLAS, unsigned int Architecture>
  struct is_supported_smem_restrict : std::bool_constant<...> { };

  // Helper variable template
  template<class BLAS, unsigned int Architecture>
  inline constexpr bool is_supported_smem_restrict_v<BLAS, Architecture> = is_supported_smem_restrict<BLAS, Architecture>::value;
}

// true if BLAS is supported on the provided CUDA architecture
cublasdx::is_supported_smem_restrict<BLAS, Architecture>::value;

cublasdx::is_supported_smem_restrict checks whether a BLAS operation is supported on Architecture CUDA architecture, assuming that:

Shared memory API is used (all A, B and C matrices have to fit into shared memory).
Input precision is the same as compute precision, and
Matrices do not alias or overlap each others ranges,
- i.e. first element of matrix B starts after last element of matrix A etc.

// true if BLAS is supported on the provided CUDA architecture
cublasdx::is_supported_smem_restrict<BLAS, Architecture>::value;

Requirements:

BLAS must have defined size, function, and transpose mode. See Description Operators section.

BLAS must include Block operator.

BLAS can’t have defined target CUDA architecture via SM operator.

Example

using namespace cublasdx;

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Block() + Precision<float>());
cublasdx::is_supported_smem_restrict<BLAS, 900>::value; // true
cublasdx::is_supported_smem_restrict<BLAS, 800>::value; // false
cublasdx::is_supported_smem_restrict<BLAS, 700>::value; // false

using BLAS = decltype(Size<96, 96, 96>() + Type<type::real>() + Block() + Precision<float>());
cublasdx::is_supported_smem_restrict<BLAS, 900>::value; // true
cublasdx::is_supported_smem_restrict<BLAS, 800>::value; // true
cublasdx::is_supported_smem_restrict<BLAS, 700>::value; // false

is_supported_rmem_restrict#

namespace cublasdx {
  // BLAS - BLAS description without CUDA architecture defined using SM operator
  // Architecture - unsigned integer representing CUDA architecture (SM)
  template<class BLAS, unsigned int Architecture>
  struct is_supported_rmem_restrict : std::bool_constant<...> { };

  // Helper variable template
  template<class BLAS, unsigned int Architecture>
  inline constexpr bool is_supported_rmem_restrict_v<BLAS, Architecture> = is_supported_rmem_restrict<BLAS, Architecture>::value;
}

// true if BLAS is supported on the provided CUDA architecture
cublasdx::is_supported_rmem_restrict<BLAS, Architecture>::value;

cublasdx::is_supported_rmem_restrict checks whether a BLAS operation is supported on Architecture CUDA architecture, assuming that:

Register API is used (only A, and B matrices have to fit into shared memory).
Input precision is the same as compute precision.
Matrices do not alias or overlap each others ranges,
- i.e. first element of matrix B starts after last element of matrix A

// true if BLAS is supported on the provided CUDA architecture
cublasdx::is_supported_rmem_restrict<BLAS, Architecture>::value;

Requirements:

BLAS must have defined size, function, and transpose mode. See Description Operators section.

BLAS must include Block operator.

BLAS can’t have defined target CUDA architecture via SM operator.

Example

using namespace cublasdx;

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Block() + Precision<float>());
cublasdx::is_supported_smem_restrict<BLAS, 800>::value; // false
cublasdx::is_supported_rmem_restrict<BLAS, 800>::value; // true

suggested_leading_dimension_of#

namespace cublasdx {
  // BLAS - BLAS description without CUDA architecture defined using SM operator
  // Architecture - unsigned integer representing CUDA architecture (SM)
  template<class BLAS, unsigned int Architecture>
  struct suggested_leading_dimension_of {
      static constexpr unsigned int lda;
      static constexpr unsigned int ldb;
      static constexpr unsigned int ldc;

      using type = LeadingDimension<lda, ldb, ldc>;

      using value_type                  = COMMONDX_STL_NAMESPACE::tuple<unsigned int, unsigned int, unsigned int>;
      static constexpr value_type value = value_type {lda, ldb, ldc};
      constexpr                   operator value_type() const noexcept { return value; }
  }

  // LeadingDimension operator with suggested leading dimensions
  template<class BLAS, unsigned int Architecture>
  using suggested_leading_dimension_of_t = typename suggested_leading_dimension_of<BLAS, Architecture>::type;

  template<class Description, unsigned int Architecture>
  inline constexpr COMMONDX_STL_NAMESPACE::tuple<unsigned int, unsigned int, unsigned int> suggested_leading_dimension_of_v = suggested_leading_dimension_of<Description, Architecture>::value;

  template<class Description, unsigned int Architecture>
  inline constexpr unsigned int suggested_leading_dimension_of_v_a = suggested_leading_dimension_of<Description, Architecture>::lda;
  template<class Description, unsigned int Architecture>
  inline constexpr unsigned int suggested_leading_dimension_of_v_b = suggested_leading_dimension_of<Description, Architecture>::ldb;
  template<class Description, unsigned int Architecture>
  inline constexpr unsigned int suggested_leading_dimension_of_v_c = suggested_leading_dimension_of<Description, Architecture>::ldc;
}

Type cublasdx::suggested_leading_dimension_of provides suggested leading dimensions for matrices A, B, and C. It is recommended to try them as in many cases it will lead to improved performance of the BLAS operation. It might be especially helpful when M, N, K are powers of two or multiples of 16. You can review cuBLASDx preformance examples.

Requirements:

BLAS must have defined size, function, and transpose mode. See Description Operators section.

BLAS must include Block operator.

Using suggested leading dimensions for architecture A1 must not be used with architecture A2.

Example

using namespace cublasdx;

using BLAS1 = decltype(Size<128, 128, 128>() + Type<type::real>() + Block() + Precision<float>());
using SuggestedLD = cublasdx::suggested_leading_dimension_of_t<BLAS, 900>;
using BLAS2 = decltype(BLAS1() + SuggestedLD() + SM<900>());

Suggested Alignment Trait#

// cublasdx::Alignment
cublasdx::suggested_alignment_of_t<BLAS>

// std::tuple<unsigned int, unsigned int, unsigned int>
cublasdx::suggested_alignment_of_v<BLAS>

// unsigned int
cublasdx::suggested_alignment_of<BLAS>::a
cublasdx::suggested_alignment_of_v_a<BLAS>

cublasdx::suggested_alignment_of<BLAS>::b
cublasdx::suggested_alignment_of_v_b<BLAS>

cublasdx::suggested_alignment_of<BLAS>::c
cublasdx::suggested_alignment_of_v_c<BLAS>

Recommended alignments (in bytes) of input matrices A, B, and C.

They are set to the maximum supported alignments, i.e. cublasdx::MaxAlignment. See Alignment Operator. It is encouraged to use them as the alignments are more likely to result in vectorized shared memory accesses, which in turn will speed up the execution.