TRSM Execution Methods#

Execution methods are used to run the BLAS functions as defined by user with cuBLASDx operators.

TRSM Block Execute Methods#

The TRSM block execute method is available when the descriptor was constructed with Function<function::TRSM>, Block Operator, and is_complete_blas_execution Trait is true.

Block-level TRSM cooperatively solves the triangular system in-place: B is overwritten with the solution X. There are no alpha / beta scalars.

using TRSM = decltype(cublasdx::Size<M, N>()
                      + cublasdx::Function<cublasdx::function::TRSM>()
                      + cublasdx::Side<cublasdx::side::left>()
                      + cublasdx::FillMode<cublasdx::fill_mode::lower>()
                      + cublasdx::Diag<cublasdx::diag::non_unit>()
                      + ...);

// #1 - Tensor API
template<class AEngine, class ALayout,
         class BEngine, class BLayout>
inline __device__ void execute(const cublasdx::tensor<AEngine, ALayout>& tensor_a,
                               cublasdx::tensor<BEngine, BLayout>&       tensor_b) const;

// #2 - Pointer API (default leading dimensions)
template<class TA, class TB>
inline __device__ void execute(TA const* matrix_a,
                               TB*       matrix_b) const;

// #3 - Pointer API (explicit leading dimensions)
template<class TA, class TB>
inline __device__ void execute(TA const* matrix_a, unsigned int lda,
                               TB*       matrix_b, unsigned int ldb) const;

Method #1 accepts cublasdx::tensor objects for A (read-only) and B (overwritten in-place with X). Any CuTe-compatible layout may be used; the tensors must have the correct shape for the configured Size and Side (A is square, B is M×N or N×M for right-side). BLAS::get_layout_smem_a() and BLAS::get_layout_smem_b() provide convenient defaults for contiguous shared-memory allocation.

Methods #2 and #3 accept raw shared memory pointers. Method #3 allows providing explicit leading dimensions at runtime.

After execute(...) returns, __syncthreads() is required before reading the result from B.

Example

using TRSM = decltype(cublasdx::Size<32, 16>()
                      + cublasdx::Precision<float>()
                      + cublasdx::Type<cublasdx::type::real>()
                      + cublasdx::Function<cublasdx::function::TRSM>()
                      + cublasdx::Side<cublasdx::side::left>()
                      + cublasdx::FillMode<cublasdx::fill_mode::lower>()
                      + cublasdx::Diag<cublasdx::diag::non_unit>()
                      + cublasdx::SM<900>()
                      + cublasdx::Block());

extern __shared__ __align__(16) cublasdx::byte smem[];

using alignment = cublasdx::alignment_of<TRSM>;

auto batch_a = cublasdx::get_batch(global_a, TRSM::get_layout_gmem_a(), blockIdx.x);
auto batch_b = cublasdx::get_batch(global_b, TRSM::get_layout_gmem_b(), blockIdx.x);

auto [smem_a, smem_b] = cublasdx::shared_memory::slice<float, float>(
    smem,
    alignment::a, TRSM::get_layout_smem_a(),
    alignment::b, TRSM::get_layout_smem_b());

cublasdx::copy<TRSM, alignment::a>(batch_a, smem_a);
cublasdx::copy<TRSM, alignment::b>(batch_b, smem_b);
cublasdx::copy_wait();

TRSM{}.execute(smem_a, smem_b);
__syncthreads();

cublasdx::copy<TRSM, alignment::b>(smem_b, batch_b);

Please refer to example 17 in Examples for a complete example.

TRSM Thread Execute Methods#

The TRSM thread execute method is available when the descriptor was constructed with Function<function::TRSM>, Thread Operator, and is_complete_blas_execution Trait is true.

Thread-level TRSM independently solves the triangular system in-place: B is overwritten with the solution X. There are no alpha / beta scalars.

using TRSM = decltype(cublasdx::Size<M, N>()
                      + cublasdx::Function<cublasdx::function::TRSM>()
                      + cublasdx::Side<cublasdx::side::left>()
                      + cublasdx::FillMode<cublasdx::fill_mode::lower>()
                      + cublasdx::Diag<cublasdx::diag::non_unit>()
                      + cublasdx::Thread()
                      + ...);

// #1 - Tensor API
template<class AEngine, class ALayout,
         class BEngine, class BLayout>
inline __device__ void execute(const cublasdx::tensor<AEngine, ALayout>& tensor_a,
                               cublasdx::tensor<BEngine, BLayout>&       tensor_b) const;

// #2 - Pointer API (default leading dimensions)
template<class TA, class TB>
inline __device__ void execute(TA const* matrix_a,
                               TB*       matrix_b) const;

// #3 - Pointer API (explicit leading dimensions)
template<class TA, class TB>
inline __device__ void execute(TA const* matrix_a, unsigned int lda,
                               TB*       matrix_b, unsigned int ldb) const;

Similarly as with TRSM Block Execute Methods, method #1 accepts cublasdx::tensor objects for A (read-only) and B (overwritten in-place with X). Any CuTe-compatible layout may be used; the tensors must have the correct shape for the configured Size and Side (A is square, B is M×N or N×M for right-side).

Methods #2 and #3 accept raw shared memory pointers. Method #3 allows providing explicit leading dimensions at runtime.

Example

using TRSM = decltype(cublasdx::Size<32, 16>()
                      + cublasdx::Precision<float>()
                      + cublasdx::Type<cublasdx::type::real>()
                      + cublasdx::Function<cublasdx::function::TRSM>()
                      + cublasdx::Side<cublasdx::side::left>()
                      + cublasdx::FillMode<cublasdx::fill_mode::lower>()
                      + cublasdx::Diag<cublasdx::diag::non_unit>()
                      + cublasdx::SM<900>()
                      + cublasdx::Thread());

auto batch_a = cublasdx::get_batch(global_a, TRSM::get_layout_gmem_a(), threadIdx.x);
auto batch_b = cublasdx::get_batch(global_b, TRSM::get_layout_gmem_b(), threadIdx.x);
TRSM{}.execute(batch_a, batch_b);

Please refer to example 17 in Examples for a complete example.

Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different

leading dimensions (LeadingDimension),
CUDA architecture (SM), or
number of threads (BlockDim)

will produce bit-identical results.

Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.