TRSM Execution Methods#

Execution methods are used to run the BLAS functions as defined by user with cuBLASDx operators.

TRSM Block Execute Methods#

The TRSM block execute method is available when the descriptor was constructed with Function<function::TRSM>, Block Operator, and is_complete_blas_execution Trait is true.

Block-level TRSM cooperatively solves the triangular system in-place: B is overwritten with the solution X. There are no alpha / beta scalars.

using TRSM = decltype(cublasdx::Size<M, N>()
                      + cublasdx::Function<cublasdx::function::TRSM>()
                      + cublasdx::Side<cublasdx::side::left>()
                      + cublasdx::FillMode<cublasdx::fill_mode::lower>()
                      + cublasdx::Diag<cublasdx::diag::non_unit>()
                      + ...);

// #1 - Tensor API
template<class AEngine, class ALayout,
         class BEngine, class BLayout>
inline __device__ void execute(const cublasdx::tensor<AEngine, ALayout>& tensor_a,
                               cublasdx::tensor<BEngine, BLayout>&       tensor_b) const;

// #2 - Pointer API (default leading dimensions)
template<class TA, class TB>
inline __device__ void execute(TA const* matrix_a,
                               TB*       matrix_b) const;

// #3 - Pointer API (explicit leading dimensions)
template<class TA, class TB>
inline __device__ void execute(TA const* matrix_a, unsigned int lda,
                               TB*       matrix_b, unsigned int ldb) const;

Method #1 accepts cublasdx::tensor objects for A (read-only) and B (overwritten in-place with X). Any CuTe-compatible layout may be used; the tensors must have the correct shape for the configured Size and Side (A is square, B is M×N or N×M for right-side). BLAS::get_layout_smem_a() and BLAS::get_layout_smem_b() provide convenient defaults for contiguous shared-memory allocation.

Methods #2 and #3 accept raw shared memory pointers. Method #3 allows providing explicit leading dimensions at runtime.

After execute(...) returns, __syncthreads() is required before reading the result from B.

Example

using TRSM = decltype(cublasdx::Size<32, 16>()
                      + cublasdx::Precision<float>()
                      + cublasdx::Type<cublasdx::type::real>()
                      + cublasdx::Function<cublasdx::function::TRSM>()
                      + cublasdx::Side<cublasdx::side::left>()
                      + cublasdx::FillMode<cublasdx::fill_mode::lower>()
                      + cublasdx::Diag<cublasdx::diag::non_unit>()
                      + cublasdx::SM<900>()
                      + cublasdx::Block());

extern __shared__ __align__(16) cublasdx::byte smem[];

using alignment = cublasdx::alignment_of<TRSM>;

auto batch_a = cublasdx::get_batch(global_a, TRSM::get_layout_gmem_a(), blockIdx.x);
auto batch_b = cublasdx::get_batch(global_b, TRSM::get_layout_gmem_b(), blockIdx.x);

auto [smem_a, smem_b] = cublasdx::shared_memory::slice<float, float>(
    smem,
    alignment::a, TRSM::get_layout_smem_a(),
    alignment::b, TRSM::get_layout_smem_b());

cublasdx::copy<TRSM, alignment::a>(batch_a, smem_a);
cublasdx::copy<TRSM, alignment::b>(batch_b, smem_b);
cublasdx::copy_wait();

TRSM{}.execute(smem_a, smem_b);
__syncthreads();

cublasdx::copy<TRSM, alignment::b>(smem_b, batch_b);

Please refer to example 17 in Examples for a complete example.

TRSM Thread Execute Methods#

The TRSM thread execute method is available when the descriptor was constructed with Function<function::TRSM>, Thread Operator, and is_complete_blas_execution Trait is true.

Thread-level TRSM independently solves the triangular system in-place: B is overwritten with the solution X. There are no alpha / beta scalars.

using TRSM = decltype(cublasdx::Size<M, N>()
                      + cublasdx::Function<cublasdx::function::TRSM>()
                      + cublasdx::Side<cublasdx::side::left>()
                      + cublasdx::FillMode<cublasdx::fill_mode::lower>()
                      + cublasdx::Diag<cublasdx::diag::non_unit>()
                      + cublasdx::Thread()
                      + ...);

// #1 - Tensor API
template<class AEngine, class ALayout,
         class BEngine, class BLayout>
inline __device__ void execute(const cublasdx::tensor<AEngine, ALayout>& tensor_a,
                               cublasdx::tensor<BEngine, BLayout>&       tensor_b) const;

// #2 - Pointer API (default leading dimensions)
template<class TA, class TB>
inline __device__ void execute(TA const* matrix_a,
                               TB*       matrix_b) const;

// #3 - Pointer API (explicit leading dimensions)
template<class TA, class TB>
inline __device__ void execute(TA const* matrix_a, unsigned int lda,
                               TB*       matrix_b, unsigned int ldb) const;

Similarly as with TRSM Block Execute Methods, method #1 accepts cublasdx::tensor objects for A (read-only) and B (overwritten in-place with X). Any CuTe-compatible layout may be used; the tensors must have the correct shape for the configured Size and Side (A is square, B is M×N or N×M for right-side).

Methods #2 and #3 accept raw shared memory pointers. Method #3 allows providing explicit leading dimensions at runtime.

Example

using TRSM = decltype(cublasdx::Size<32, 16>()
                      + cublasdx::Precision<float>()
                      + cublasdx::Type<cublasdx::type::real>()
                      + cublasdx::Function<cublasdx::function::TRSM>()
                      + cublasdx::Side<cublasdx::side::left>()
                      + cublasdx::FillMode<cublasdx::fill_mode::lower>()
                      + cublasdx::Diag<cublasdx::diag::non_unit>()
                      + cublasdx::SM<900>()
                      + cublasdx::Thread());

auto batch_a = cublasdx::get_batch(global_a, TRSM::get_layout_gmem_a(), threadIdx.x);
auto batch_b = cublasdx::get_batch(global_b, TRSM::get_layout_gmem_b(), threadIdx.x);
TRSM{}.execute(batch_a, batch_b);

Please refer to example 17 in Examples for a complete example.


Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs but with different

will produce bit-identical results.

Warning

It is not guaranteed that executions of exactly the same BLAS function with exactly the same inputs on GPUs of different CUDA architectures will produce bit-identical results.