Traits¶

Traits provide the user with information about the function descriptor constructed using Operators. They are divided into Description Traits and Execution Traits.

Description Traits¶

Trait	Description
`size_of<Description>`	Problem size we intend to solve.
`type_of<Description>`	Data type used, either `type::real` or `type::complex`.
`precision_of<Description>`	Precision of the underlying floating-point values used for `A`, `B`, and `C`.
`function_of<Description>`	Function to be executed.
`arrangement_of<Description>`	Arrangement mode of the matrix - `arrangement::col-major`, and `arrangement::row-major`.
`transpose_mode_of<Description>`	Transpose mode of the matrix - `transpose_mode::non_transposed`, `transpose_mode::transposed`, and `transpose_mode::conj_transposed`.
`alignment_of<Description>`	Alignments (in bytes) of matrices `A`, `B`, and `C`.
`suggested_alignment_of<Description>`	Suggested alignments (in bytes) of matrices `A`, `B`, and `C`.
`sm_of<Description>`	Target architecture for the underlying computation.
`is_blas<Description>`	`true` if `Description` is a function description, formed with Description Operators.
`is_blas_execution<Description>`	`true` if `Description` is a function description, configured for execution with Execution Operators.
`is_complete_blas<Description>`	`true` if `Description` is a valid and complete function description, formed with Description Operators.
`is_complete_blas_execution<Description>`	`true` if `is_complete_blas<Description>` is `true` and `is_blas_execution<Description>` is `true`.

Description traits can be retrieved from an function descriptor using the helper functions provided. For example:

#include <cublasdx.hpp>

using GEMM = decltype(cublasdx::Size<32, 32, 64>()
              + cublasdx::Precision<float, float, double>()
              + cublasdx::Type<cublasdx::type::real>()
              + cublasdx::TransposeMode<cublasdx::T, cublasdx::N>()
              + cublasdx::Function<cublasdx::function::MM>()
              + cublasdx::SM<700>()
              + cublasdx::Block());

if(cublasdx::is_complete_blas<GEMM>::value)
  std::cout << "GEMM (M x N x K): "
    << cublasdx::size_of<GEMM>::m << " x "
    << cublasdx::size_of<GEMM>::n << " x "
    << cublasdx::size_of<GEMM>::k << std::endl;

Size Trait¶

// std::tuple<unsigned int, unsigned int, unsigned int>
cublasdx::size_of<BLAS>::value
cublasdx::size_of_v<BLAS>

// unsigned int
cublasdx::size_of<BLAS>::m
cublasdx::size_of<BLAS>::n
cublasdx::size_of<BLAS>::k

size_of trait gives size of the problem we want to solve, as set by Size Operator. If the descriptor was not created using a Size Operator, compilation will fail with an error message.

Type Trait¶

// cublasdx::type
cublasdx::type_of<BLAS>::value
cublasdx::type_of_v<BLAS>

Data type (cublasdx::type::real or cublasdx::type::complex) used in the function, as set by Type Operator.

Precision Trait¶

// Precision type
cublasdx::precision_of<BLAS>::a_type
cublasdx::precision_of<BLAS>::b_type
cublasdx::precision_of<BLAS>::c_type

Floating-point precision of the input data A and B and output data C, as set by Precision Operator.

Function Trait¶

// cublasdx::function
cublasdx::function_of<BLAS>::value
cublasdx::function_of_v<BLAS>

Function to be executed, as set by Function Operator. If the descriptor was not created using a Function Operator, compilation will fail with an error message.

Arrangement Trait¶

// std::tuple<arrangement, arrangement, arrangement>
cublasdx::arrangement_of<BLAS>::value
cublasdx::arrangement_of_v<BLAS>

// cublasdx::arrangement
cublasdx::arrangement_of<BLAS>::a
cublasdx::arrangement_of<BLAS>::b
cublasdx::arrangement_of<BLAS>::c

Arrangement of A, B, and C matrices.

Transpose Mode Trait¶

Warning

transpose_mode_of trait is deprecated since 0.2.0 and may be removed in future versions.

// cublasdx::transpose_mode
cublasdx::transpose_mode_of<BLAS>::a_transpose_mode
cublasdx::transpose_mode_of<BLAS>::b_transpose_mode

Transpose mode of A, and B matrices.

Alignment Trait¶

// std::tuple<unsigned int, unsigned int, unsigned int>
cublasdx::alignment_of<BLAS>::value
cublasdx::alignment_of_v<BLAS>

// unsigned int
cublasdx::alignment_of<BLAS>::a
cublasdx::alignment_of<BLAS>::b
cublasdx::alignment_of<BLAS>::c

Alignment (in bytes) of matrices A, B, and C.

Suggested Alignment Trait¶

// cublasdx::Alignment
cublasdx::suggested_alignment_of_t<BLAS>

// unsigned int
cublasdx::suggested_alignment_of<BLAS>::a
cublasdx::suggested_alignment_of<BLAS>::b
cublasdx::suggested_alignment_of<BLAS>::c

Recommended alignments (in bytes) of matrices A, B, and C.

They are set to the maximum supported alignments, i.e. cublasdx::MaxAlignment. See Alignment Operator. It is encouraged to use them as the alignments are more likely to result in vectorized shared memory accesses, which in turn will speed up the execution.

SM Trait¶

// unsigned int
cublasdx::sm_of<BLAS>::value
cublasdx::sm_of_v<BLAS>

GPU architecture used to run the function. For example, gives 700 for Volta (sm_70).

is_blas Trait¶

// bool
cublasdx::is_blas<BLAS>::value

Trait is true if the descriptor is a function description, formed with Description Operators.

is_blas_execution Trait¶

// bool
cublasdx::is_blas_execution<BLAS>::value

Trait is true if the descriptor is a function description configured for execution, formed with Description Operators and Execution Operators.

is_complete_blas Trait¶

// bool
cublasdx::is_complete_blas<BLAS>::value

Trait is true if the descriptor is a complete function description, formed with Description Operators.

Note

Complete in this context means that the descriptor has been formed with all the necessary Description Operators and it is only missing an Execution Operators to be able to run.

For an function descriptor to be complete, the following is required:

One, and only one, Size Operator.

One, and only one, Function Operator.

One, and only one, SM Operator.

is_complete_blas_execution Trait¶

// bool
cublasdx::is_complete_blas_execution<BLAS>::value

Trait is true if both is_blas_execution Trait and is_complete_blas Trait are true.

Note

If is_complete_blas_execution Trait is true for a descriptor descriptor, then we can use the Execution Methods to execute the function.

Execution Traits¶

Execution traits can be retrieved directly from a BLAS descriptor that has been configured with Execution Operators. The available execution traits may depend on the operator used to build the descriptor. Right now, Block Operator is the only available execution operator.

Block Traits¶

Trait	Default value	Description
`Description::<a/b/c>_value_type`	`float`, `float`, `float`	Type of input and output data `A`, `B`, `C`.
`Description::<a/b/c>_dim`	Determined by the problem size and transpose mode.	Logical dimensions of `A`, `B`, `C` matrices determined by the problem size and transpose mode.
`Description::ld<a/b/c>`	Determined by the problem size or set via LeadingDimension Operator	Leading dimensions of matrices `A`, `B`, `C`.
`Description::<a/b/c>_alignment`	Determined by the type of matrices `A`, `B`, `C` or set via Alignment Operator	Alignments (in bytes) of matrices `A`, `B`, `C`.
`Description::<a/b/c>_size`	Determined by the problem size and set leading dimensions.	Number of elements in `A`, `B`, `C` matrices. Includes padding determined by set leading dimensions.
`Description::shared_memory_size`	Determined by the problem size and set leading dimensions.	The size of the shared memory in bytes required to allocate input and output matrices.
`Description::block_dim`	Base on heuristic or set via BlockDim Operator	Value of type `dim3` representing number of threads computing BLAS function. Set or recommended CUDA block dimensions.
`Description::suggested_block_dim`	Based on heuristic.	Recommended number of threads for computing BLAS function represented as CUDA block dimensions (`dim3` value).
`Description::max_threads_per_block`	`X * Y * Z` where `X, Y, Z = Description::block_dim`	Number of threads in `Description::block_dim`.

Block traits can be retrieved from descriptors built with Block Operator.

For example:

#include <cublasdx.hpp>

Value Type Trait¶

BLAS::a_value_type
BLAS::b_value_type
BLAS::c_value_type

Input (A and B matrices) and output (C matrix) types. See Supported GEMM Data Types.

Determined by Precision and Type.

The default type is float. It can be a real or complex type depending on the Type operator set.

The tables below present the value types for A, B, and C matrices depending on the corresponding precisions set in Precision operator and type set via Type operator. Columns marked as alternatives present types that can be safely used instead of <a/b/c>_value_type types. Other types can also be used as long as their alignment and size are the same as those of <a/b/c>_value_type.

Precision \ Type	real
Precision \ Type	<a/b/c>_value_type	Alternative#1
__nv_fp8_e5m2	__nv_fp8_e5m2	cublasdx::float_e5m2_t
__nv_fp8_e4m3	__nv_fp8_e4m3	cublasdx::float_e4m3_t
__half	__half	cublasdx::half_t
__nv_bfloat16	__nv_bfloat16	cublasdx::bfloat16_t
cublasdx::tfloat32_t	cublasdx::tfloat32_t	cublasdx::tfloat32_t
float	float	float
double	double	double

Precision \ Type	complex
Precision \ Type	<a/b/c>_value_type	Alternative#1	Alternative#2
__nv_fp8_e5m2	cublasdx::complex<__nv_fp8_e5m2>	cublasdx::complex<cublasdx::float_e5m2_t>	__nv_fp8x2_e5m2
__nv_fp8_e4m3	cublasdx::complex<__nv_fp8_e4m3>	cublasdx::complex<cublasdx::float_e4m3_t>	__nv_fp8x2_e4m3
__half	cublasdx::complex<__half>	cublasdx::complex<cublasdx::half_t>	__half2
__nv_bfloat16	cublasdx::complex<__nv_bfloat16>	cublasdx::complex<cublasdx::bfloat16_t>	__nv_bfloat162
cublasdx::tfloat32_t	cublasdx::complex<cublasdx::tfloat32_t>
float	cublasdx::complex<float>		float2
double	cublasdx::complex<double>		double2

Note

Types cublasdx::float_e5m2_t, cublasdx::float_e4m3_t, cublasdx::half_t, cublasdx::bfloat16_t, and cublasdx::tfloat32_t are aliases for CUTLASS types cutlass::float_e5m2_t, cutlass::float_e4m3_t, cutlass::half_t, cutlass::bfloat16_t, and cutlass::tfloat32_t.

Matrix Dim Trait¶

Warning

Matrix dimension traits are deprecated since 0.2.0 and may be removed in future versions.

// tuple<unsigned int, unsigned int>
BLAS::a_dim
BLAS::b_dim
BLAS::c_dim

Logical dimensions of matrices A, B, C in form of (rows, columns) tuple. The dimensions are determined by the problem size, arrangement and transpose mode.

See GEMM, Size Operator, Arrangement Operator and TransposeMode Operator.

Leading Dimension Trait¶

// unsigned int
BLAS::lda
BLAS::ldb
BLAS::ldc

Leading dimensions of matrices A, B, C.

See GEMM, Size Operator and LeadingDimension Operator.

Matrix Size Trait¶

// unsigned int
BLAS::a_size
BLAS::b_size
BLAS::c_size

Number of elements in A, B, C matrices. It includes padding determined by set leading dimensions.

See GEMM, Size Operator and LeadingDimension Operator.

Alignment Trait¶

// unsigned int
BLAS::a_alignment
BLAS::b_alignment
BLAS::c_alignment

Alignments (in bytes) of matrices A, B, C. See Alignment Operator.

Shared Memory Size Trait¶

// unsigned int
BLAS::shared_memory_size

unsigned int BLAS::get_shared_memory_size()
// Calculates shared memory size for dynamic leading dimensions
unsigned int BLAS::get_shared_memory_size(unsigned int lda,
                                          unsigned int ldb,
                                          unsigned int ldc)
// Calculates shared memory size for custom matrices layouts
template<class ALayout, class BLayout, class CLayout>
unsigned int BLAS::get_shared_memory_size(ALayout a_layout,
                                          BLayout b_layout,
                                          CLayout c_layout)

The size of the shared memory in bytes required to allocate input and output matrices, and perform computations.

The value is determined by size of value types, matrix sizes and alignments.

Note that BLAS::get_shared_memory_size accepts arbitrary CuTe layouts. Class ALayout, BLayout and CLayout in the above function prototype could be either cute::Layout or cute::ComposedLayout.

Block Dim Trait¶

// dim3
BLAS::block_dim

Value of type dim3 representing number of threads that will be used to perform requested BLAS function. It is equal to the specified (or default if not) CUDA block dimensions. See BlockDim Operator.

If BlockDim is not used in BLAS description the default value of BLAS::block_dim is equal to BLAS::suggested_block_dim.

Suggested Block Dim Trait¶

// dim3
BLAS::suggested_block_dim

Recommended number of threads for BLAS description.

Max Threads Per Block Trait¶

BLAS::max_threads_per_block

Number of thread is recommended or set block dimensions, X * Y * Z where X, Y, Z = Description::block_dim.

Other Traits¶

Trait	Description
`is_supported<Description, Arch>`	Verifies if given BLAS operation is supported on the provided CUDA architecture (`Architecture`).
`suggested_leading_dimension_of<Description, Arch>`	Provides leading dimensions that may improve performance of BLAS operation.

is_supported¶

namespace cublasdx {
  // BLAS - BLAS description without CUDA architecture defined using SM operator
  // Architecture - unsigned integer representing CUDA architecture (SM)
  template<class BLAS, unsigned int Architecture>
  struct is_supported : std::bool_constant<...> { };

  // Helper variable template
  template<class BLAS, unsigned int Architecture>
  inline constexpr bool is_supported_v<BLAS, Architecture> = is_supported<BLAS, Architecture>::value;
}

// true if BLAS is supported on the provided CUDA architecture
cublasdx::is_supported<BLAS, Architecture>::value;

cublasdx::is_supported checks whether a BLAS operation is supported on Architecture CUDA architecture.

// true if BLAS is supported on the provided CUDA architecture
cublasdx::is_supported<BLAS, Architecture>::value;

Requirements:

BLAS must have defined size, function, and transpose mode. See Description Operators section.

BLAS must include Block operator.

BLAS can’t have defined target CUDA architecture via SM operator.

Example

using namespace cublasdx;

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Block() + Precision<float>());
cublasdx::is_supported<BLAS, 900>::value; // true
cublasdx::is_supported<BLAS, 800>::value; // false
cublasdx::is_supported<BLAS, 700>::value; // false

using BLAS = decltype(Size<96, 96, 96>() + Type<type::real>() + Block() + Precision<float>());
cublasdx::is_supported<BLAS, 900>::value; // true
cublasdx::is_supported<BLAS, 800>::value; // true
cublasdx::is_supported<BLAS, 700>::value; // false

suggested_leading_dimension_of¶

namespace cublasdx {
  // BLAS - BLAS description without CUDA architecture defined using SM operator
  // Architecture - unsigned integer representing CUDA architecture (SM)
  template<class BLAS, unsigned int Architecture>
  struct suggested_leading_dimension_of {
      static constexpr unsigned int lda;
      static constexpr unsigned int ldb;
      static constexpr unsigned int ldc;

      using type = LeadingDimension<lda, ldb, ldc>;
  }

  // LeadingDimension operator with suggested leading dimensions
  template<class BLAS, unsigned int Architecture>
  using suggested_leading_dimension_of_t = typename suggested_leading_dimension_of<BLAS, Architecture>::type;
}

Type cublasdx::suggested_leading_dimension_of provides suggested leading dimensions for matrices A, B, and C. It is recommended to try them as in many cases it will lead to improved performance of the BLAS operation. It might be especially helpful when M, N, K are powers of two or multiples of 16. You can review cuBLASDx preformance examples.

Requirements:

BLAS must have defined size, function, and transpose mode. See Description Operators section.

BLAS must include Block operator.

Using suggested leading dimensions for architecture A1 must not be used with architecture A2.

Example

using namespace cublasdx;

using BLAS1 = decltype(Size<128, 128, 128>() + Type<type::real>() + Block() + Precision<float>());
using SuggestedLD = cublasdx::suggested_leading_dimension_of_t<BLAS, 900>;
using BLAS2 = decltype(BLAS1() + SuggestedLD() + SM<900>());