Other Methods#
Get Memory Layout#
BLAS::get_layout_<gmem/smem>_<a/b/c>() function fetches global memory or shared memory CuTe layout for matrix A, B or C,
determined by matrix sizes, arrangement,
and leading dimensions. For shared memory layouts the leading dimensions, if not specified
explicitly through a parameter, will be inferred from the leading dimensions operator. For
global memory layouts custom leading dimensions must be passed either through a static or dynamic integral type, or otherwise
they will be inferred from matrix sizes.
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b();
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c();
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_a();
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_b();
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_c();
// Overloads for specifying the leading dimensions statically during compilation time.
// integral_type can be either signed or unsigned integer type and integral_value follow
// this specification.
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const std::integral_constant<integral_type, lda>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const std::integral_constant<integral_type, ldb>);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const std::integral_constant<integral_type, ldc>);
// Overloads for specifying the leading dimensions during the execution time.
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const unsigned int lda);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const unsigned int ldb);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_c(const unsigned int ldc);
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_a(const unsigned int lda);
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_b(const unsigned int ldb);
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_c(const unsigned int ldc);
// TRSM only: overloads with batch stride (for BatchesPerBlock > 1)
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_a(const unsigned int lda, const unsigned int batch_stride_a);
__forceinline__ __host__ __device__ constexpr static auto get_layout_gmem_b(const unsigned int ldb, const unsigned int batch_stride_b);
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_a(const unsigned int lda, const unsigned int batch_stride_a);
__forceinline__ __host__ __device__ constexpr static auto get_layout_smem_b(const unsigned int ldb, const unsigned int batch_stride_b);
BLAS::get_layout_<gmem/smem>_<a/b/c>() returns a combination of memory tag (global or shared) and the layout
(cute::Layout)
for matrix A, B or C which can be directly passed to cublasdx::make_tensor to create a tensor.
BLAS::get_layout_<gmem/smem>_<a/b/c>() returns a matrix layout corresponding to the order set via
Arrangement operator. For example, if the order for A matrix was set to cublasdx::row-major,
the returned layout follows the row-major order.
In case of dynamic leading dimensions provided by user during execution time, the function accepts the leading dimension as an argument, see the example below.
For TRSM descriptors with BatchesPerBlock > 1, the layout functions return a 3D layout
(rows x cols x batches). The two-argument overloads get_layout_<gmem/smem>_<a/b>(ld, batch_stride) are available
only for TRSM and allow specifying both the inner leading dimension and the stride between consecutive batches.
When only one argument is provided, the batch stride defaults to the cosize of the inner (single-batch) layout.
When no arguments are provided, both the leading dimension and the batch stride use their default values.
Example
using BLAS = decltype(...)
extern __shared__ __align__(16) cublasdx::byte smem[];
// a, b, c are pointers to global memory of input matrices A and B and output matrix C
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a());
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b());
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c());
auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a());
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b());
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c());
// With leading dimensions specified during the compilation time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(std::integral_constant<int, lda>{}));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(std::integral_constant<int, ldb>{}));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(std::integral_constant<int, ldc>{}));
// With leading dimensions specified during the execution time
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a(lda));
auto b_global_tensor = cublasdx::make_tensor(b, BLAS::get_layout_gmem_b(ldb));
auto c_global_tensor = cublasdx::make_tensor(c, BLAS::get_layout_gmem_c(ldc));
auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem, lda, ldb, ldc);
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a(lda));
auto b_shared_tensor = cublasdx::make_tensor(smem_b, BLAS::get_layout_smem_b(ldb));
auto c_shared_tensor = cublasdx::make_tensor(smem_c, BLAS::get_layout_smem_c(ldc));
TRSM batch stride example
// TRSM with BatchesPerBlock > 1: use two-argument overload for custom batch strides
using TRSM = decltype(cublasdx::Size<32, 16>()
+ cublasdx::Function<cublasdx::function::TRSM>()
+ cublasdx::BatchesPerBlock<4>()
+ ...);
// Default batch stride (cosize of single-batch layout)
auto a_layout = TRSM::get_layout_smem_a(); // 3D layout: (dim_a, dim_a, 4)
auto b_layout = TRSM::get_layout_smem_b(); // 3D layout: (M, N, 4)
// Custom leading dimension, default batch stride
auto a_layout = TRSM::get_layout_smem_a(lda);
// Custom leading dimension and custom batch stride (TRSM only)
auto a_layout = TRSM::get_layout_smem_a(lda, batch_stride_a);
auto b_layout = TRSM::get_layout_smem_b(ldb, batch_stride_b);
Data Accumulator#
Data accumulator is an object aware of execution context and implementation details of a GEMM, from which it is
able to infer which elements of C matrix will be mapped to any thread. cuBLASDx uses accumulators as helper objects
for partitioning global and shared memory tensors, as well as getting, copying, modifying and transforming register fragments.
Please refer to Accumulator And Register Fragment Tensors for more information on available accumulator functionality.
Get data accumulator#
Default data accumulator is used for non-suggested execution contexts. Mixing contexts may cause subpar performance.
// Get layouts
auto a_smem_layout = BLAS::get_layout_smem_a();
auto b_smem_layout = BLAS::get_layout_smem_b();
// Get accumulator
auto accumulator = BLAS::get_accumulator();
Suggest data accumulator#
Suggested data accumulator is used for suggested execution contexts. Mixing contexts may cause subpar performance.
// Suggest layouts
auto a_smem_layout = BLAS::suggest_layout_smem_a();
auto b_smem_layout = BLAS::suggest_layout_smem_b();
// Suggest accumulator
auto accumulator = BLAS::suggest_accumulator();
Tensor Views#
cuBLASDx provides lazy tensor view wrappers that transform a tensor’s elements on access without
copying data. They are lightweight descriptors backed by the original tensor’s storage and are
the recommended way to express transposition and conjugation when calling execute().
Warning
The deprecated TransposeMode operator does not work with TRSM. Use tensor views together
with the Arrangement operator instead.
See Tensor Views in TRSM for details.
cublasdx::transpose_view#
template<class Tensor>
__host__ __device__ auto cublasdx::transpose_view(Tensor&& tensor);
Returns a lazy view of tensor with its first two modes (rows and columns) swapped. No data
is copied — the view simply reinterprets the layout. For 1D tensors the input is returned
unchanged.
cublasdx::conjugate_view#
template<class Tensor>
__host__ __device__ auto cublasdx::conjugate_view(Tensor&& tensor);
Returns a lazy view that applies complex conjugation (\(a + bi \rightarrow a - bi\)) to each element on access. For real types the operation is a no-op.
cublasdx::conj_transpose_view#
template<class Tensor>
__host__ __device__ auto cublasdx::conj_transpose_view(Tensor&& tensor);
Returns a lazy view that both transposes and conjugates the tensor — the Hermitian adjoint
(\(A^H\)). Equivalent to applying conjugate_view after transpose_view.
This is especially important for TRSM when the triangular matrix A must be treated as its
conjugate transpose. See Tensor Views in TRSM.
cublasdx::make_transform_view#
template<class Tensor, class Transform>
__host__ __device__ auto cublasdx::make_transform_view(Tensor&& tensor, Transform transform);
Returns a lazy view that applies an arbitrary stateless element-wise transform functor to each
element on access. The built-in cublasdx::conjugate functor can be used here; user-defined
stateless functors are also accepted. See cublasdx::conjugate for more information.
Element-wise Functors#
In addition to the tensor views above, cuBLASDx provides two built-in stateless element-wise functors that can be passed directly as transform arguments to execute().
cublasdx::identity#
struct cublasdx::identity;
Passes each element through unchanged. This is the default transform for every load and store
position in execute(...). Using identity explicitly is equivalent to omitting that argument.
cublasdx::conjugate#
struct cublasdx::conjugate;
Applies complex conjugation to each element. For real types the operation is a no-op (element is returned unchanged). For complex types it negates the imaginary part: \(a + bi \rightarrow a - bi\).
Note
Both functors are stateless. Only stateless functors are accepted by execute(...).