Other Methods#

Get Memory Layout#

BLAS::get_layout_<gmem/smem>_<a/b/c>() function fetches global memory or shared memory CuTe layout for matrix A, B or C, determined by matrix sizes, arrangement, and leading dimensions. For shared memory layouts the leading dimensions, if not specified explicitly through a parameter, will be inferred from the leading dimensions operator. For global memory layouts custom leading dimensions must be passed either through a static or dynamic integral type, or otherwise they will be inferred from matrix sizes.

__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c();

__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_a();
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_b();
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_c();

// Overloads for specifying the leading dimensions statically during compilation time.
// integral_type can be either signed or unsigned integer type and integral_value follow
// this specification.
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const std::integral_constant<integral_type, lda>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const std::integral_constant<integral_type, ldb>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const std::integral_constant<integral_type, ldc>);

// Overloads for specifying the leading dimensions during the execution time.

__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const unsigned int lda);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const unsigned int ldb);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const unsigned int ldc);

__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_a(const unsigned int lda);
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_b(const unsigned int ldb);
__forceinline__ __host__ __device__  constexpr static auto get_layout_smem_c(const unsigned int ldc);

// TRSM only: overloads with batch stride (for BatchesPerBlock > 1)
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const unsigned int lda, const unsigned int batch_stride_a);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const unsigned int ldb, const unsigned int batch_stride_b);

__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_a(const unsigned int lda, const unsigned int batch_stride_a);
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_b(const unsigned int ldb, const unsigned int batch_stride_b);

BLAS::get_layout_<gmem/smem>_<a/b/c>() returns a combination of memory tag (global or shared) and the layout (cute::Layout) for matrix A, B or C which can be directly passed to cublasdx::make_tensor to create a tensor.

BLAS::get_layout_<gmem/smem>_<a/b/c>() returns a matrix layout corresponding to the order set via Arrangement operator. For example, if the order for A matrix was set to cublasdx::row-major, the returned layout follows the row-major order.

In case of dynamic leading dimensions provided by user during execution time, the function accepts the leading dimension as an argument, see the example below.

For TRSM descriptors with BatchesPerBlock > 1, the layout functions return a 3D layout (rows x cols x batches). The two-argument overloads get_layout_<gmem/smem>_<a/b>(ld, batch_stride) are available only for TRSM and allow specifying both the inner leading dimension and the stride between consecutive batches. When only one argument is provided, the batch stride defaults to the cosize of the inner (single-batch) layout. When no arguments are provided, both the leading dimension and the batch stride use their default values.

Example

using BLAS = decltype(...)

extern __shared__ __align__(16) cublasdx::byte smem[];

// a, b, c are pointers to global memory of input matrices A and B and output matrix C
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c());

auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b());
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c());

// With leading dimensions specified during the compilation time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(std::integral_constant<int, lda>{}));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(std::integral_constant<int, ldb>{}));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(std::integral_constant<int, ldc>{}));

// With leading dimensions specified during the execution time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(lda));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(ldb));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(ldc));

auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem, lda, ldb, ldc);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a(lda));
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b(ldb));
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c(ldc));

TRSM batch stride example

// TRSM with BatchesPerBlock > 1: use two-argument overload for custom batch strides
using TRSM = decltype(cublasdx::Size<32, 16>()
                      + cublasdx::Function<cublasdx::function::TRSM>()
                      + cublasdx::BatchesPerBlock<4>()
                      + ...);

// Default batch stride (cosize of single-batch layout)
auto a_layout = TRSM::get_layout_smem_a();       // 3D layout: (dim_a, dim_a, 4)
auto b_layout = TRSM::get_layout_smem_b();       // 3D layout: (M, N, 4)

// Custom leading dimension, default batch stride
auto a_layout = TRSM::get_layout_smem_a(lda);

// Custom leading dimension and custom batch stride (TRSM only)
auto a_layout = TRSM::get_layout_smem_a(lda, batch_stride_a);
auto b_layout = TRSM::get_layout_smem_b(ldb, batch_stride_b);

Suggested shared memory Layout#

In addition to get_layout_smem_* function, cuBLASDx also provides a function that returns the suggested custom shared memory layouts for matrix A, B or C, determined by the value types, matrix sizes, arrangements, alignments, block size, and GPU architecture. Those suggested layouts were designed to positively impact the performance of matrix multiplication itself as well as copy operations between shared and global memory, and because of that they rely on the arrangements of the matrices. Suggested layouts ignore leading dimensions set with LeadingDimension operator. The best improvements are expected with row-major A and column-major B.

__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_a();
__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_b();
__forceinline__ __host__ __device__ constexpr static auto suggest_layout_smem_c();

BLAS::suggest_layout_smem_<a/b/c>() returns a combination of a shared memory tag and the layout of A/B/C (cute::Layout), which can be directly passed to cublasdx::make_tensor to create a tensor.

Example

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Block() + Precision<float, float, double>());

extern __shared__ __align__(16) cublasdx::byte smem[];

// Slice shared memory into pointer for A, B, and C matrices
auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem);

// Create suggested shared memory layout for optimal performance
auto suggested_smem_a = cublasdx::make_tensor(smem_a, BLAS::suggest_layout_smem_a());
auto suggested_smem_b = cublasdx::make_tensor(smem_b, BLAS::suggest_layout_smem_b());
auto suggested_smem_c = cublasdx::make_tensor(smem_c, BLAS::suggest_layout_smem_c());

Data Accumulator#

Data accumulator is an object aware of execution context and implementation details of a GEMM, from which it is able to infer which elements of C matrix will be mapped to any thread. cuBLASDx uses accumulators as helper objects for partitioning global and shared memory tensors, as well as getting, copying, modifying and transforming register fragments.

Please refer to Accumulator And Register Fragment Tensors for more information on available accumulator functionality.

Get data accumulator#

Default data accumulator is used for non-suggested execution contexts. Mixing contexts may cause subpar performance.

// Get layouts
auto a_smem_layout = BLAS::get_layout_smem_a();
auto b_smem_layout = BLAS::get_layout_smem_b();

// Get accumulator
auto accumulator = BLAS::get_accumulator();

Suggest data accumulator#

Suggested data accumulator is used for suggested execution contexts. Mixing contexts may cause subpar performance.

// Suggest layouts
auto a_smem_layout = BLAS::suggest_layout_smem_a();
auto b_smem_layout = BLAS::suggest_layout_smem_b();

// Suggest accumulator
auto accumulator = BLAS::suggest_accumulator();

Tensor Views#

cuBLASDx provides lazy tensor view wrappers that transform a tensor’s elements on access without copying data. They are lightweight descriptors backed by the original tensor’s storage and are the recommended way to express transposition and conjugation when calling execute().

Warning

The deprecated TransposeMode operator does not work with TRSM. Use tensor views together with the Arrangement operator instead. See Tensor Views in TRSM for details.

cublasdx::transpose_view#

template<class Tensor>
__host__ __device__ auto cublasdx::transpose_view(Tensor&& tensor);

Returns a lazy view of tensor with its first two modes (rows and columns) swapped. No data is copied — the view simply reinterprets the layout. For 1D tensors the input is returned unchanged.

cublasdx::conjugate_view#

template<class Tensor>
__host__ __device__ auto cublasdx::conjugate_view(Tensor&& tensor);

Returns a lazy view that applies complex conjugation (\(a + bi \rightarrow a - bi\)) to each element on access. For real types the operation is a no-op.

cublasdx::conj_transpose_view#

template<class Tensor>
__host__ __device__ auto cublasdx::conj_transpose_view(Tensor&& tensor);

Returns a lazy view that both transposes and conjugates the tensor — the Hermitian adjoint (\(A^H\)). Equivalent to applying conjugate_view after transpose_view.

This is especially important for TRSM when the triangular matrix A must be treated as its conjugate transpose. See Tensor Views in TRSM.

cublasdx::make_transform_view#

template<class Tensor, class Transform>
__host__ __device__ auto cublasdx::make_transform_view(Tensor&& tensor, Transform transform);

Returns a lazy view that applies an arbitrary stateless element-wise transform functor to each element on access. The built-in cublasdx::conjugate functor can be used here; user-defined stateless functors are also accepted. See cublasdx::conjugate for more information.

Element-wise Functors#

In addition to the tensor views above, cuBLASDx provides two built-in stateless element-wise functors that can be passed directly as transform arguments to execute().

cublasdx::identity#

struct cublasdx::identity;

Passes each element through unchanged. This is the default transform for every load and store position in execute(...). Using identity explicitly is equivalent to omitting that argument.

cublasdx::conjugate#

struct cublasdx::conjugate;

Applies complex conjugation to each element. For real types the operation is a no-op (element is returned unchanged). For complex types it negates the imaginary part: \(a + bi \rightarrow a - bi\).

Note

Both functors are stateless. Only stateless functors are accepted by execute(...).