Pipelined GEMM Using cuBLASDx Pipelines#

With release of cuBLASDx 0.5.0 the pipelining extension has been added to the library, allowing to combine multiple per-tile operations (ones exposed by regular BLAS descriptors) into asynchroneous pipelines exposing advanced CUDA features such as TMA, Hopper WGMMA, Blackwell 1SM UTCMMA, Blackwell TMEM, and others.

The pipelining interface emphasizes fusion opportunities by moving the level of cuBLASDx execution higher, starting from the global memory level, while allowing users to wrap their device GEMM into a single fusable and optimized function call.

Important

cuBLASDx Pipelining API is an interface allowing the users to apply cuBLASDx routines to efficiently decompose and compute big global-memory GEMM problems.

This section is based on the introduction_pipeline.cu example included with cuBLASDx. See the Examples section for additional cuBLASDx samples.

To measure GEMM device performance and compare with reference implementation, see device_gemm_performance.cu example included with cuBLASDx.

Pipelining#

The regular MathDx operations are performed on the shared memory tile level, assuming that inputs are ready to be safely used, so it usually looks like this:

Load data from shared into registers
Perform computations using shared memory as scratchpad (if needed)
Store results back to shared memory

This allows the libraries to utilize the out of order capabilities only on the shared --> registers memory level, leaving the remaining operations to the users.

Pipelining’s main pillar is overlapping N-stages of data loading with computation being performed on other units, thus maximizing the number of bytes in flight. An advanced extension of pipelining is the producer-consumer model, where separate threads are responsible for loading data into buffers and separate for computing results of the loaded elements.

cuBLASDx 0.2.0 introduced cublasdx::copy, an asynchroneous copy mechanism allowing for limited overlapping of per-tile computations and memory loads for next batch, allowing for fuller utilization of available hardware resources. Although this release provided building blocks for asynchroneous execution, it left construction of entire scaffold to the user and did not provide space for more sophisticated synchronization schemes, such as mbarriers, thus blocking the use of Tensor Memory Accelerator.

cuBLASDx 0.5.0 Pipelining API exposes:

same interface for performing optimized GEMM on all architectures,
arbitrary depth scaffolding for performing pipelined Generalized Matrix Multiplication (GEMM),
autovectorizing copy mechanism choosing between TMA / LDGSTS / LDG+STS based on provided layouts and architecture,
synchronization dependent MMA instructions, available only in pipelined mode (WGMMA / 1SM UTCMMA), and
mbarrier based synchronization mechanism, internally choosing between warp-specialized and unified execution paths

Quick Start Guide#

Pipeline creation#

To use the pipelining API two steps are necessary:

On host, create your BLAS description for per-tile operation.
1. Add WithPipeline operator to inform the library that this tile will be used in a pipelined context.
2. (performance) If you won’t be using any per-element input preprocessing, add EnableInputStreaming operator.
3. (performance) If you won’t be using bigger number of threads than specified in BlockDim operator, add StaticBlockDim operator.

#include <cublasdx.hpp>
using namespace cublasdx;

using GEMM = decltype(Size<128 /* m */, 128 /* n */, 32 /* k */>()
                      + Precision<__half, __half, float>()
                      + Type<type::real>()
                      + Function<function::MM>()
                      + Arrangement<cublasdx::row_major /* A */, cublasdx::col_major /* B */>()
                      + Alignment<16, 16, 16>()
                      + SM<890>()
                      + WithPipeline()
                      + EnableInputStreaming()
                      + StaticBlockDim());

On host, use the description from step 1 to create a device pipeline object.
1. Create global memory tensors for global A and B matrices
2. Pass them as arguments to suggest_device_pipeline function.
3. Verify if std::optional wrapped return value contains a valid pipeline.

#include <cublasdx.hpp>
using namespace cublasdx;

using BLAS = decltype(... + WithPipeline());

// Create global descriptors for full A and B matrices (not only tiles)
auto global_a = cublasdx::make_gmem_tensor<cublasdx::row_major>(a_device_pointer, m, k, lda);
auto global_b = cublasdx::make_gmem_tensor<cublasdx::col_major>(b_device_pointer, k, n, ldb);

constexpr int pipeline_depth = 4;
auto opt_device_pipeline = cublasdx::suggest_device_pipeline<pipeline_depth, BLAS>(global_a, global_b);

if (not opt_device_pipeline) {
    std::cout << "Incorrect pipeline configuration, please ensure global tensors are divisible by tile"
              << std::endl;
    exit(1);
}

auto device_pipeline = opt_device_pipeline.value();

Requirements on pipeline creation:

Correct problem definition: A and B matrices must be compatible, i.e. have the same K dimension
Global problem divisibility: global problem size must be divisible by descriptor specified tile size (this limitation might be removed in the future).
Number of pipeline stages (depth) must be smaller or equal than total number of stages
1. num_stages == global_k / tile_k

Attention

Why are pipelines created on host?

Some elements of the CUDA SDK require driver calls, one of the being the Tensor Memory Accelerator descriptor creation. To expose TMA as one of the automatically chosen options underneath, cuBLASDx must adhere to this limitation.

Passing pipeline to kernels#

Device pipelines, after being succesfully created, expose a series of properties allowing to efficiently configure a kernel call.

buffer_alignment() describes alignment of single shared memory buffer owned by the pipeline.
buffer_size() describes size of single shared memory buffer owned by the pipeline.
get_block_dim() describes the required block configuration to run this pipeline.

Example of a kernel execution configuration:

// ... pipeline creation
auto device_pipeline = opt_device_pipeline.value();

auto shared_memory_size = cublasdx::make_shared_storage_calculator()
                          .add(device_pipeline.buffer_alignment(), device_pipeline.buffer_size())
                          .get();

auto block_dim = device_pipeline.get_block_dim();

kernel<<<grid_dim, block_dim, shared_memory_size>>>(device_pipeline, ...);

Taking pipeline as argument#

Device pipeline (the one created on host) can be passed only to __global__ functions (not __device__) when annotated with __grid_constant__.

Additionally, a ::max_threads_per_block trait can be used for getting new launch bounds limiting value.

template<class BLAS, class Alpha, class Beta, class CTensor, class DevicePipeline>
__launch_bounds__(DevicePipeline::max_threads_per_block, 1) __global__
    void gemm_kernel(Alpha const                            alpha,
                     Beta const                             beta,
                     CTensor                                global_c,
                     __grid_constant__ DevicePipeline const device_pipeline) {

Initializing tile pipeline in kernels#

The device_pipeline object is supposed to remain only as a kernel argument, never to be copied to on-chip memory, instead it needs to be converted to a per-block tile_pipeline object by indexing:

extern __shared__ __align__(device_pipeline.buffer_alignment()) char smem[];
auto tile_pipeline = device_pipeline.get_tile(smem, blockIdx.x, blockIdx.y);

Alternatively, smem can be sliced to fit something other than pipeline itself:

extern __shared__ __align__(device_pipeline.buffer_alignment()) char smem[];
auto [smem_pipeline, smem_c_tensor] =
    cublasdx::shared_memory::slice<char, CType>(
        smem,
        device_pipeline.buffer_alignment(),
        device_pipeline.buffer_size(),
        cublasdx::alignment_of_v_c<BLAS>,
        BLAS::suggest_layout_smem_c());

auto tile_pipeline = device_pipeline.get_tile(smem_pipeline, blockIdx.x, blockIdx.y);

How to partition global tensor into tiles?

The general device GEMM looks like the following, where total global GEMM dimensions are MxNxK and per-block tile-dimensions are tile_m x tile_n x tile_k:

The C matrix represents the problem partitioning the most as it is the reduction output tensor, and its size (M x N) is divided into tile_m x tile_n independent sub-problems, each of which is solved by an independent CUDA block. These sub-problems could be addressed in the following manner:

The simplest way to perform this partitioning is to launch a CUDA grid of size:

dim3 grid_dim = dim3 {m / tile_m, n / tile_n, 1};

This way there will be exactly 1 block for each 1 sub-problem, and the block index will be the same as sub-problem (tile index). There are other, more advanced, software ways of scheduling work, such as threadblock swizzling or persistent stream-k scheduling.

Resetting a pipeline#

If a persistent execution is chosen, the pipeline needs to reset its state and start executing with a new output tile. This is possible with a call to device_pipeline.reset_tile().

device_pipeline.reset_tile(tile_pipeline,
                           new_tile_idx_row,
                           new_tile_idx_col);

Warning

tile_pipeline object has a destructive constructor and destructor, which can be called only once per kernel execution. It’s important to avoid creating new tile_pipeline every loop iteration and use reset_tile instead.

Executing Pipelined GEMM#

Depending on chosen accumulation mode, there are 2 ways of executing a pipelined GEMM:

Execute with internal accumulation (pipeline owns the accumulator).

auto tile_pipeline = device_pipeline.get_tile(smem, blockIdx.x, blockIdx.y);
auto tile_gmem_c   = cublasdx::get_tile(global_c, BLAS::c_shape, blockIdx.x, blockIdx.y);

// Prepare epilogue functor
auto epilogue_functor = [&](auto& accumulator) {
    auto d_fragment = accumulator.make_partition_and_copy(tile_gmem_c);
    cublasdx::axpby(alpha, accumulator.get_results(), beta, d_fragment);
    accumulator.partition_and_copy(d_fragment, tile_gmem_c);
};

tile_pipeline.execute(epilogue_functor);

In this interface the results of the GEMM are accessed via epilogue functor, getting accumulator as its only argument.

What is available in the epilogue functor?

Epilogue functor always:

operates on the number of threads used as the value in BlockDim operator
can be synchronized using tile_pipeline.epilogue_sync()

This is also true when heuristic_blocksize mode of pipeline execution is used. This synchronization operation has the same memory guarantees as a regular syncthreads (bar.sync).

Execute with external accumulation (user owns the accumulator).

auto tile_pipeline = device_pipeline.get_tile(smem, blockIdx.x, blockIdx.y);
auto tile_gmem_c   = cublasdx::get_tile(global_c, BLAS::c_shape, blockIdx.x, blockIdx.y);

auto accumulator = tile_pipeline.get_accumulator();

tile_pipeline.execute(accumulator);

// Inlined epilogue
auto d_fragment = accumulator.make_partition_and_copy(tile_gmem_c);
cublasdx::axpby(alpha, accumulator.get_results(), beta, d_fragment);
accumulator.partition_and_copy(d_fragment, tile_gmem_c);

In this interface the results of the GEMM are accessed via epilogue functor, getting accumulator as its only argument.

Advanced pipeline configuration options#

Heuristic based pipeline depth choice#

Pipeline depth is an argument that can be either carefully chosen and specified, or left to the library, which will try to fill as much of shared memory as possible, by choosing the architecturally maximal depth.

// Explicit choice version
constexpr int pipeline_depth = 4;
auto opt_device_pipeline = cublasdx::suggest_device_pipeline<pipeline_depth, BLAS>(global_a, global_b);

// Heuristic version
auto opt_device_pipeline = cublasdx::suggest_device_pipeline<BLAS>(global_a, global_b);

Result storage options#

This advanced option relates to ownership of pipeline results, these can be either owned by the pipeline and exposes only as lambda argument:

// User owns the accumulator
auto opt_device_pipeline = cublasdx::suggest_device_pipeline<BLAS, cublasdx::external_accumulation>(global_a, global_b);

// ... later in device code
auto epilogue_functor = [](auto const& accumulator) {
    // some epilogue
};

tile_pipeline.execute(epilogue_functor);

Or alternatively, if more flexibility is required, they can be owned by the user, which may lead to performance regressions on some architectures.

// Pipeline owns the accumulator

// host code
auto opt_device_pipeline = cublasdx::suggest_device_pipeline<BLAS, cublasdx::internal_accumulation>(global_a, global_b);

// ... later in device code
auto accumulator = tile_pipeline.get_accumulator();
tile_pipeline.execute(accumulator);
some_epilogue(accumulator.get_results());

Warning

Why external accumulation can cause performance regressions?

cuBLASDx internally applies register trading in warp-specialized kernels, allowing some threads to own more registers than others. This optimization is disabled when user wants to own the results.

Blocksize strategy options#

As already mentioned, by default cuBLASDx pipeline requires the user to apply its device_pipeline.get_block_dim() value to be used as kernel configuration parameter. This allows the internal implementation to use optimizations such as warp-specialization, adding extra threads to only act as producers.

Alternatively, as an advanced option, the user can specify cublasdx::fixed_blocksize which forces the pipeline to use the same block dim rules as the BlockDim operator. This choice provides more control and flexibility, but disables warp-specialized kernel paths.

// Default - heuristic
auto opt_device_pipeline = cublasdx::suggest_device_pipeline<BLAS, cublasdx::internal_accumulation, cublasdx::heuristic_blocksize>(global_a, global_b);

// Fixed - explicit user choice
auto opt_device_pipeline = cublasdx::suggest_device_pipeline<BLAS, cublasdx::internal_accumulation, cublasdx::fixed_blocksize>(global_a, global_b);