Traits

Traits provide the user with information about the function descriptor constructed using Operators. They are divided into Description Traits and Execution Traits.


Description Traits

Trait

Description

size_of<Description>

Problem size we intend to solve.

type_of<Description>

Data type used, either type::real or type::complex.

precision_of<Description>

Precision of the underlying floating-point values used for A, B, and C.

function_of<Description>

Function to be executed.

arrangement_of<Description>

Arrangement mode of the matrix - arrangement::col-major, and arrangement::row-major.

transpose_mode_of<Description>

Transpose mode of the matrix - transpose_mode::non_transposed, transpose_mode::transposed, and transpose_mode::conj_transposed.

alignment_of<Description>

Alignments (in bytes) of matrices A, B, and C.

suggested_alignment_of<Description>

Suggested alignments (in bytes) of matrices A, B, and C.

sm_of<Description>

Target architecture for the underlying computation.

is_blas<Description>

true if Description is a function description, formed with Description Operators.

is_blas_execution<Description>

true if Description is a function description, configured for execution with Execution Operators.

is_complete_blas<Description>

true if Description is a valid and complete function description, formed with Description Operators.

is_complete_blas_execution<Description>

true if is_complete_blas<Description> is true and is_blas_execution<Description> is true.

Description traits can be retrieved from an function descriptor using the helper functions provided. For example:

#include <cublasdx.hpp>

using GEMM = decltype(cublasdx::Size<32, 32, 64>()
              + cublasdx::Precision<float, float, double>()
              + cublasdx::Type<cublasdx::type::real>()
              + cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
              + cublasdx::Function<cublasdx::function::MM>()
              + cublasdx::SM<700>()
              + cublasdx::Block());

if(cublasdx::is_complete_blas<GEMM>::value)
  std::cout << "GEMM (M x N x K): "
    << cublasdx::size_of<GEMM>::m << " x "
    << cublasdx::size_of<GEMM>::n << " x "
    << cublasdx::size_of<GEMM>::k << std::endl;

Size Trait

// std::tuple<unsigned int, unsigned int, unsigned int>
cublasdx::size_of<BLAS>::value
cublasdx::size_of_v<BLAS>

// unsigned int
cublasdx::size_of<BLAS>::m
cublasdx::size_of<BLAS>::n
cublasdx::size_of<BLAS>::k

size_of trait gives size of the problem we want to solve, as set by Size Operator. If the descriptor was not created using a Size Operator, compilation will fail with an error message.

Type Trait

// cublasdx::type
cublasdx::type_of<BLAS>::value
cublasdx::type_of_v<BLAS>

Data type (cublasdx::type::real or cublasdx::type::complex) used in the function, as set by Type Operator.

Precision Trait

// Precision type
cublasdx::precision_of<BLAS>::a_type
cublasdx::precision_of<BLAS>::b_type
cublasdx::precision_of<BLAS>::c_type

Floating-point precision of the input data A and B and output data C, as set by Precision Operator.

Function Trait

// cublasdx::function
cublasdx::function_of<BLAS>::value
cublasdx::function_of_v<BLAS>

Function to be executed, as set by Function Operator. If the descriptor was not created using a Function Operator, compilation will fail with an error message.

Arrangement Trait

// std::tuple<arrangement, arrangement, arrangement>
cublasdx::arrangement_of<BLAS>::value
cublasdx::arrangement_of_v<BLAS>

// cublasdx::arrangement
cublasdx::arrangement_of<BLAS>::a
cublasdx::arrangement_of<BLAS>::b
cublasdx::arrangement_of<BLAS>::c

Arrangement of A, B, and C matrices.

Transpose Mode Trait

Warning

transpose_mode_of trait is deprecated since 0.2.0 and may be removed in future versions.

// cublasdx::transpose_mode
cublasdx::transpose_mode_of<BLAS>::a_transpose_mode
cublasdx::transpose_mode_of<BLAS>::b_transpose_mode

Transpose mode of A, and B matrices.

Alignment Trait

// std::tuple<unsigned int, unsigned int, unsigned int>
cublasdx::alignment_of<BLAS>::value
cublasdx::alignment_of_v<BLAS>

// unsigned int
cublasdx::alignment_of<BLAS>::a
cublasdx::alignment_of<BLAS>::b
cublasdx::alignment_of<BLAS>::c

Alignment (in bytes) of matrices A, B, and C.

Suggested Alignment Trait

// cublasdx::Alignment
cublasdx::suggested_alignment_of_t<BLAS>

// unsigned int
cublasdx::suggested_alignment_of<BLAS>::a
cublasdx::suggested_alignment_of<BLAS>::b
cublasdx::suggested_alignment_of<BLAS>::c

Recommended alignments (in bytes) of matrices A, B, and C.

They are set to the maximum supported alignments, i.e. cublasdx::MaxAlignment. See Alignment Operator. It is encouraged to use them as the alignments are more likely to result in vectorized shared memory accesses, which in turn will speed up the execution.

SM Trait

// unsigned int
cublasdx::sm_of<BLAS>::value
cublasdx::sm_of_v<BLAS>

GPU architecture used to run the function. For example, gives 700 for Volta (sm_70).

is_blas Trait

// bool
cublasdx::is_blas<BLAS>::value

Trait is true if the descriptor is a function description, formed with Description Operators.

is_blas_execution Trait

// bool
cublasdx::is_blas_execution<BLAS>::value

Trait is true if the descriptor is a function description configured for execution, formed with Description Operators and Execution Operators.

is_complete_blas Trait

// bool
cublasdx::is_complete_blas<BLAS>::value

Trait is true if the descriptor is a complete function description, formed with Description Operators.

Note

Complete in this context means that the descriptor has been formed with all the necessary Description Operators and it is only missing an Execution Operators to be able to run.

For an function descriptor to be complete, the following is required:

is_complete_blas_execution Trait

// bool
cublasdx::is_complete_blas_execution<BLAS>::value

Trait is true if both is_blas_execution Trait and is_complete_blas Trait are true.

Note

If is_complete_blas_execution Trait is true for a descriptor descriptor, then we can use the Execution Methods to execute the function.

Execution Traits

Execution traits can be retrieved directly from a BLAS descriptor that has been configured with Execution Operators. The available execution traits may depend on the operator used to build the descriptor. Right now, Block Operator is the only available execution operator.

Block Traits

Trait

Default value

Description

Description::<a/b/c>_value_type

float, float, float

Type of input and output data A, B, C.

Description::<a/b/c>_dim

Determined by the problem size and transpose mode.

Logical dimensions of A, B, C matrices determined by the problem size and transpose mode.

Description::ld<a/b/c>

Determined by the problem size or set via LeadingDimension Operator

Leading dimensions of matrices A, B, C.

Description::<a/b/c>_alignment

Determined by the type of matrices A, B, C or set via Alignment Operator

Alignments (in bytes) of matrices A, B, C.

Description::<a/b/c>_size

Determined by the problem size and set leading dimensions.

Number of elements in A, B, C matrices. Includes padding determined by set leading dimensions.

Description::shared_memory_size

Determined by the problem size and set leading dimensions.

The size of the shared memory in bytes required to allocate input and output matrices.

Description::block_dim

Base on heuristic or set via BlockDim Operator

Value of type dim3 representing number of threads computing BLAS function. Set or recommended CUDA block dimensions.

Description::suggested_block_dim

Based on heuristic.

Recommended number of threads for computing BLAS function represented as CUDA block dimensions (dim3 value).

Description::max_threads_per_block

X * Y * Z where X, Y, Z = Description::block_dim

Number of threads in Description::block_dim.

Block traits can be retrieved from descriptors built with Block Operator.

For example:

#include <cublasdx.hpp>

Value Type Trait

BLAS::a_value_type
BLAS::b_value_type
BLAS::c_value_type

Input (A and B matrices) and output (C matrix) types. See Supported GEMM Data Types.

Determined by Precision and Type.

The default type is float. It can be a real or complex type depending on the Type operator set.

The tables below present the value types for A, B, and C matrices depending on the corresponding precisions set in Precision operator and type set via Type operator. Columns marked as alternatives present types that can be safely used instead of <a/b/c>_value_type types. Other types can also be used as long as their alignment and size are the same as those of <a/b/c>_value_type.

Precision \ Type

real

<a/b/c>_value_type

Alternative#1

__nv_fp8_e5m2

__nv_fp8_e5m2

cublasdx::float_e5m2_t

__nv_fp8_e4m3

__nv_fp8_e4m3

cublasdx::float_e4m3_t

__half

__half

cublasdx::half_t

__nv_bfloat16

__nv_bfloat16

cublasdx::bfloat16_t

cublasdx::tfloat32_t

cublasdx::tfloat32_t

cublasdx::tfloat32_t

float

float

float

double

double

double

Precision \ Type

complex

<a/b/c>_value_type

Alternative#1

Alternative#2

__nv_fp8_e5m2

cublasdx::complex<__nv_fp8_e5m2>

cublasdx::complex<cublasdx::float_e5m2_t>

__nv_fp8x2_e5m2

__nv_fp8_e4m3

cublasdx::complex<__nv_fp8_e4m3>

cublasdx::complex<cublasdx::float_e4m3_t>

__nv_fp8x2_e4m3

__half

cublasdx::complex<__half>

cublasdx::complex<cublasdx::half_t>

__half2

__nv_bfloat16

cublasdx::complex<__nv_bfloat16>

cublasdx::complex<cublasdx::bfloat16_t>

__nv_bfloat162

cublasdx::tfloat32_t

cublasdx::complex<cublasdx::tfloat32_t>

float

cublasdx::complex<float>

float2

double

cublasdx::complex<double>

double2

Note

Types cublasdx::float_e5m2_t, cublasdx::float_e4m3_t, cublasdx::half_t, cublasdx::bfloat16_t, and cublasdx::tfloat32_t are aliases for CUTLASS types cutlass::float_e5m2_t, cutlass::float_e4m3_t, cutlass::half_t, cutlass::bfloat16_t, and cutlass::tfloat32_t.

Matrix Dim Trait

Warning

Matrix dimension traits are deprecated since 0.2.0 and may be removed in future versions.

// tuple<unsigned int, unsigned int>
BLAS::a_dim
BLAS::b_dim
BLAS::c_dim

Logical dimensions of matrices A, B, C in form of (rows, columns) tuple. The dimensions are determined by the problem size, arrangement and transpose mode.

See GEMM, Size Operator, Arrangement Operator and TransposeMode Operator.

Leading Dimension Trait

// unsigned int
BLAS::lda
BLAS::ldb
BLAS::ldc

Leading dimensions of matrices A, B, C.

See GEMM, Size Operator and LeadingDimension Operator.

Matrix Size Trait

// unsigned int
BLAS::a_size
BLAS::b_size
BLAS::c_size

Number of elements in A, B, C matrices. It includes padding determined by set leading dimensions.

See GEMM, Size Operator and LeadingDimension Operator.

Alignment Trait

// unsigned int
BLAS::a_alignment
BLAS::b_alignment
BLAS::c_alignment

Alignments (in bytes) of matrices A, B, C. See Alignment Operator.

Shared Memory Size Trait

// unsigned int
BLAS::shared_memory_size

unsigned int BLAS::get_shared_memory_size()
// Calculates shared memory size for dynamic leading dimensions
unsigned int BLAS::get_shared_memory_size(unsigned int lda,
                                          unsigned int ldb,
                                          unsigned int ldc)
// Calculates shared memory size for custom matrices layouts
template<class ALayout, class BLayout, class CLayout>
unsigned int BLAS::get_shared_memory_size(ALayout a_layout,
                                          BLayout b_layout,
                                          CLayout c_layout)

The size of the shared memory in bytes required to allocate input and output matrices, and perform computations.

The value is determined by size of value types, matrix sizes and alignments.

Note that BLAS::get_shared_memory_size accepts arbitrary CuTe layouts. Class ALayout, BLayout and CLayout in the above function prototype could be either cute::Layout or cute::ComposedLayout.

Block Dim Trait

// dim3
BLAS::block_dim

Value of type dim3 representing number of threads that will be used to perform requested BLAS function. It is equal to the specified (or default if not) CUDA block dimensions. See BlockDim Operator.

If BlockDim is not used in BLAS description the default value of BLAS::block_dim is equal to BLAS::suggested_block_dim.

Suggested Block Dim Trait

// dim3
BLAS::suggested_block_dim

Recommended number of threads for BLAS description.

Max Threads Per Block Trait

BLAS::max_threads_per_block

Number of thread is recommended or set block dimensions, X * Y * Z where X, Y, Z = Description::block_dim.

Other Traits

Trait

Description

is_supported<Description, Arch>

Verifies if given BLAS operation is supported on the provided CUDA architecture (Architecture).

suggested_leading_dimension_of<Description, Arch>

Provides leading dimensions that may improve performance of BLAS operation.

is_supported

namespace cublasdx {
  // BLAS - BLAS description without CUDA architecture defined using SM operator
  // Architecture - unsigned integer representing CUDA architecture (SM)
  template<class BLAS, unsigned int Architecture>
  struct is_supported : std::bool_constant<...> { };

  // Helper variable template
  template<class BLAS, unsigned int Architecture>
  inline constexpr bool is_supported_v<BLAS, Architecture> = is_supported<BLAS, Architecture>::value;
}

// true if BLAS is supported on the provided CUDA architecture
cublasdx::is_supported<BLAS, Architecture>::value;

cublasdx::is_supported checks whether a BLAS operation is supported on Architecture CUDA architecture.

// true if BLAS is supported on the provided CUDA architecture
cublasdx::is_supported<BLAS, Architecture>::value;

Requirements:

  • BLAS must have defined size, function, and transpose mode. See Description Operators section.

  • BLAS must include Block operator.

  • BLAS can’t have defined target CUDA architecture via SM operator.

Example

using namespace cublasdx;

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Block() + Precision<float>());
cublasdx::is_supported<BLAS, 900>::value; // true
cublasdx::is_supported<BLAS, 800>::value; // false
cublasdx::is_supported<BLAS, 700>::value; // false

using BLAS = decltype(Size<96, 96, 96>() + Type<type::real>() + Block() + Precision<float>());
cublasdx::is_supported<BLAS, 900>::value; // true
cublasdx::is_supported<BLAS, 800>::value; // true
cublasdx::is_supported<BLAS, 700>::value; // false

suggested_leading_dimension_of

namespace cublasdx {
  // BLAS - BLAS description without CUDA architecture defined using SM operator
  // Architecture - unsigned integer representing CUDA architecture (SM)
  template<class BLAS, unsigned int Architecture>
  struct suggested_leading_dimension_of {
      static constexpr unsigned int lda;
      static constexpr unsigned int ldb;
      static constexpr unsigned int ldc;

      using type = LeadingDimension<lda, ldb, ldc>;
  }

  // LeadingDimension operator with suggested leading dimensions
  template<class BLAS, unsigned int Architecture>
  using suggested_leading_dimension_of_t = typename suggested_leading_dimension_of<BLAS, Architecture>::type;
}

Type cublasdx::suggested_leading_dimension_of provides suggested leading dimensions for matrices A, B, and C. It is recommended to try them as in many cases it will lead to improved performance of the BLAS operation. It might be especially helpful when M, N, K are powers of two or multiples of 16. You can review cuBLASDx preformance examples.

Requirements:

  • BLAS must have defined size, function, and transpose mode. See Description Operators section.

  • BLAS must include Block operator.

  • Using suggested leading dimensions for architecture A1 must not be used with architecture A2.

Example

using namespace cublasdx;

using BLAS1 = decltype(Size<128, 128, 128>() + Type<type::real>() + Block() + Precision<float>());
using SuggestedLD = cublasdx::suggested_leading_dimension_of_t<BLAS, 900>;
using BLAS2 = decltype(BLAS1() + SuggestedLD() + SM<900>());