Tensors#

cuBLASDx Tensor#

cuBLASDx exposes cublasdx::tensor class which is an alias of a cute::Tensor class from CuTe library (see CUTLASS).

Tensor Creation#

// #1: Wrapper of cute::make_tensor
template<class Iterator, class... Args>
__device__ __host__
constexpr cublasdx::tensor make_tensor(const Iterator& iter, Args const&... args);

// #2: With pointer layout returned by the get_<smem/gmem>_layout_<a/b/c>, suggest_layout_smem_<a/b/c> method from the BLAS description.
template<class T, class PointerLayout>
__device__ __host__
constexpr cublasdx::tensor make_tensor(T* ptr, const PointerLayout& pl);

cublasdx::make_tensor is helper function for creating cublasdx::tensor objects.

There are two variants. The first one is simply a wrapper of cute::make_tensor(…), which usually requires manually tagging the raw pointer with its memory space. The other one works together with the Get Memory Layout and the Suggested shared memory Layout methods. It creates a global or shared memory tensor with the returned pointer layout. In contrast to the first variant, it will pick up the memory space information from the pointer layouts and tag the raw pointer correspondingly.

Example

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Precision<float, float, double>() + Block() + ...);

// tensor with global memory data
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a());

// tensor with shared memory data
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a());
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::suggest_layout_smem_a());

Partitioner And Register Fragment Tensors#

cuBLASDx 0.3.0 introduces 2 new memory objects: register fragment tensors and partitioners.

Since register memory is limited to a thread, it’s not used to store entire tensors, only small parts of them. A division of a global / shared memory tensor into threads taking part in a BLAS is called a partitioning. In cuBLASDx the BLAS operation is a source of partitioning pattern.

In every BLAS execution each thread is assigned some elements of result matrix C which it calculates. This selection is dependent upon 3 things:

  1. number of threads,

  2. chosen MMA instruction, and

  3. tiling chosen for the MMA instruction.

User of cuBLASDx has direct control over only 1 of these 3 options, hence the partitioning information must be opaque and used only with BLAS object which defined it. cuBLASDx exposes this opaque information by means of a partitioner.

Partitioner#

Partitioner objects can be obtained from BLAS type in 2 ways:

// #1a As pre-requested default partitioner
auto partitioner = BLAS::get_partitioner();

// #1b As pre-requested suggested partitioner
auto partitioner = BLAS::suggest_partitioner();

// #2 As result of register API without accumulator call
auto [c_register_fragment, partitioner] = BLAS().execute(a_shared_tensor, b_shared_tensor);

Note

Options #1a and #1b require the user to choose version coherent with used layouts. Suggested partitioner only with suggested layouts and default partitioner with any other layouts. Mixing them may result in subpar performance.

Partitioner is an opaque object exposing the following API:

__device__ bool is_predicated();
__device__ bool is_thread_active();
__device__ constexpr auto make_accumulator_fragment();

template<class CTensor>
__forceinline__ __device__
auto partition_like_C(CTensor && ctensor) const;

template<class ... Coords>
__forceinline__ __device__
auto map_fragment_index(Coords&& ... coords) const;

template<class ... Coords>
__forceinline__ __device__
bool is_index_in_bounds(Coords&& ... coords) const;

There are 3 methods that can be used to get basic information about the partitioning pattern, or retrieve a register fragment.

// Partitioning properties
__device__ bool is_predicated();
__device__ bool is_thread_active();

// Accumulator creation, creates a register cublasdx::tensor
__device__ constexpr auto make_accumulator_fragment();

As mentioned in Partitioner And Register Fragment Tensors the division of elements among threads happens by tiling an MMA instruction over the BLAS defined problem size. Each instruction is responsible for computing certain shape, and it’s possible that the problem shape is not divisible by primitive instruction shape. In such cases the “extra” instruction elements are filled with 0s instead of reading from memory, and skipped when storing results. This is called element predication. The is_predicated() member function can be used to retrieve that information.

Due to cuBLASDx supporting execution within kernels with CUDA thread block sizes not matching the BlockDim Operator, not always will all threads take part in a GEMM operation. This implies that some threads may not have any elements assigned to them. is_thread_active_result() member function can be used to check exactly whether this is the case for the calling thread.

Whether the return bool value can be known at compile time or runtime will depend on the specific BLAS configuration used.

The final argumentless method - make_accumulator_fragment() can be used to retrieve a non-initialized register fragment tensor corresponding to BLAS execution from which the partitioner was retrieved.

Please see code below for an example of use of all 3 methods.

template<class CTensor>
__forceinline__ __device__
auto partition_like_C(CTensor && ctensor) const;

This method will return a non-owning view of its argument’s subtensor assigned to the calling thread, corresponding to its local register fragment.

All these methods can be used for manual copying of unpredicated GEMM as follows:

Partitioner example #1#

auto c_global_tensor = ...;

// Get partitioner
auto partitioner = BLAS::get_partitioner();
// Create local register fragment
auto c_register_fragment = partitioner.make_accumulator_fragment();
// Get view of this thread's global memory subtensor
auto c_global_partition = partitioner.partition_like_C(c_global_tensor);

// Ensure that all elements are in-bounds and
// no predication is necessary
static_assert(not partitioner.is_predicated());

// If this thread takes part in GEMM
if(partitioner.is_thread_active()) {
   // For each element of register fragment
   for(int i = 0; i < cublasdx::size(c_register_fragment); ++i) {
      // Copy respective global element into it
      c_register_fragment(i) = c_global_partition(i);
   }
}

// C += A * B
BLAS().execute(a_shared_tensor, b_shared_tensor, c_register_fragment);

If a GEMM is predicated, more information will be necessary:

template<class ... Coords>
__forceinline__ __device__
auto map_fragment_index(Coords&& ... coords) const;

template<class ... Coords>
__forceinline__ __device__
bool is_index_in_bounds(Coords&& ... coords) const;

These 2 functions extend functionality of is_predicated() allowing to map local register fragment index to its source (global or shared) tensor index, as well as check if this index is in bounds.

All these methods can be used to perform a bound-checked loading of elements from global to register tensor:

Partitioner example #2#

auto c_global_tensor = ...;

// Get partitioner
auto partitioner = BLAS::get_partitioner();
// Create local register fragment
auto c_register_fragment = partitioner.make_accumulator_fragment();

// If this thread takes part in GEMM
if(partitioner.is_thread_active()) {
   // For each element of register fragment
   for(int i = 0; i < cublasdx::size(c_register_fragment); ++i) {
      auto global_index = partitioner.map_fragment_index(i);
      if((not partitioner.is_predicated()) or partitioner.is_index_in_bounds(i)) {
         // Copy respective global element into it
         c_register_fragment(i) = load_op(c_global_tensor(global_index));
      }
   }
}

// C += A * B
BLAS().execute(a_shared_tensor, b_shared_tensor, c_register_fragment);

It would be very tedious to have to do such loading manually every time, hence cuBLASDx offers optimized and auto-vectorizing global/shared ⟷ register fragment copying functions. See Copying registers tensors for more information

Imported Tensor Utilities#

Since cuBLASDx integrated CuTe library, it exposes some of its tensor and layout functionality, sometimes adding compatibility layers to enable its own types support. Currently following functions are supported:

// Aliased
using cute::clear;
using cute::transform;
using cute::make_fragment_like;

// With compatibility layers
cublasdx::size;   // (cute::size)
cublasdx::cosize; // (cute::cosize)
cublasdx::axpby;  // (cute::axpby)

cublasdx::clear takes a register fragment as input and sets all its values to 0.

auto partitioner = BLAS::get_partitioner();
auto c_register_fragment = partitioner.make_accumulator_fragment();

cublasdx::clear(c_register_fragment);

cublasdx::transform applies an elementwise functor to all elements of passed register tensor.

auto c_global_tensor = ...

auto partitioner = BLAS::get_partitioner();
auto c_register_fragment = partitioner.make_accumulator_fragment();

cublasdx::copy_fragment<alignment_of<BLAS>::c>(c_global_tensor, c_register_fragment, partitioner);

// in-place
cublasdx::transform(c_register_fragment, [](auto v) { return 2 * v; });

// out-of-place
auto d_register_fragment = partitioner.make_accumulator_fragment();
cublasdx::transform(c_register_fragment, d_register_fragment, [](auto v) { return 2 * v; });

cublasdx::make_fragment_like can be used to create a register fragment with the same layout and type as its argument tensor.

auto c_global_tensor = ...

auto partitioner = BLAS::get_partitioner();
auto c_global_partition = partitioner.partition_like_C(c_global_tensor);

// Same type
auto c_register_fragment = cublasdx::make_fragment_like(c_global_partition);

// Other type
using new_type = double;
auto c_register_fragment = cublasdx::make_fragment_like<new_type>(c_global_partition);

Warning

The following functions should be used only in their cuBLASDx form, as it offers compatibility layers and automatic conversions.

cublasdx::axpby performs a \(\mathbf{D}_{m\times n} = {\alpha} \times \mathbf{C}_{m\times n} + {\beta} \times \mathbf{D}_{m\times n}\)

auto c_global_tensor = ...

auto a_shared_tensor = ...
auto b_shared_tensor = ...

auto partitioner = BLAS::get_partitioner();
auto c_register_fragment = partitioner.make_accumulator_fragment();
auto d_register_fragment = partitioner.make_accumulator_fragment();
cublasdx::copy_fragment<alignment_of<BLAS>::c>(c_global_tensor, c_register_fragment, partitioner);

// These 2 functions combined perform the classic GEMM: C = alpha * A * B + beta * C;
BLAS().execute(a_shared_tensor, b_shared_tensor, c_register_fragment);
cublasdx::axpby(alpha, c_register_fragment, beta, d_register_fragment);

cublasdx::copy_fragment<alignment_of<BLAS>::c>(d_register_fragment, c_global_tensor, partitioner);

cublasdx::size returns a number of valid elements in a tensor. This is simply a product of all shape dimensions.

auto a_shared_layout = BLAS::get_layout_smem_a();

// Same as size_of<BLAS>::m * size_of<BLAS>::k
constexpr auto layout_size = cublasdx::size(a_shared_layout);

cublasdx::cosize returns a distance from last element of a tensor to its first element. It describes how many elements does the argument layout span. This is the same as cublasdx::size for layouts which are compact, or have no “holes” (e.g. as result of extra leading dimension elements)

// Default leading dimension
auto BLAS = decltype(... + Size<M, N, K>() + LeadingDimension<M, K, M>());
auto a_shared_layout = BLAS::get_layout_smem_a();

constexpr auto layout_size = cublasdx::size(a_shared_layout);
constexpr auto layout_cosize = cublasdx::cosize(a_shared_layout);

static_assert(layout_size == layout_cosize);

// Extra elements leading dimension
constexpr auto lda = M + 1; // Add padding to avoid shared memory bank conflicts
auto BLAS = decltype(... + Size<M, N, K>() + LeadingDimension<lda, K, M>());
auto a_shared_layout = BLAS::get_layout_smem_a();

constexpr auto layout_size = cublasdx::size(a_shared_layout);
constexpr auto layout_cosize = cublasdx::cosize(a_shared_layout);

static_assert(layout_size != layout_cosize);
static_assert(layout_size < layout_cosize);

Copying Tensors#

Cooperative Global ⟷ Shared Copying#

template<uint32_t NumThreads,       // Number of threads performing copy operation
         uint32_t AlignmentInBytes, // Pointer alignment of src and dst tensor (minimum of them if they are different)
         class SrcEngine,
         class SrcLayout,
         class DstEngine,
         class DstLayout>
__forceinline__ __device__
void copy(const unsigned int                            tid, // Thread index in CUDA block
          const cublasdx::tensor<SrcEngine, SrcLayout>& src,
          cublasdx::tensor<DstEngine, DstLayout>&       dst)

// Assumes pointers in both dst and src tensors are not extra aligned
template<uint32_t NumThreads, // Number of threads performing copy operation
         class SrcEngine,
         class SrcLayout,
         class DstEngine,
         class DstLayout>
__forceinline__ __device__
void copy(const unsigned int                            tid, // Thread index in CUDA block
          const cublasdx::tensor<SrcEngine, SrcLayout>& src,
          cublasdx::tensor<DstEngine, DstLayout>&       dst)

template<class BLAS,                // BLAS description which provides the number of threads
         uint32_t AlignmentInBytes, // Pointer alignment of src and dst tensor (minimum of them if they are different)
         class SrcEngine,
         class SrcLayout,
         class DstEngine,
         class DstLayout>
__forceinline__ __device__
void copy(const cublasdx::tensor<SrcEngine, SrcLayout>& src,
          cublasdx::tensor<DstEngine, DstLayout>&       dst)

cublasdx::copy is helper function for copying data between tensors that are either in shared or global memory.

The copy is done cooperatively. All threads, indicated either by NumThreads or by BLAS::block_dim, will participate in the copy. The function takes into account of the given alignments and attempt to vectorize the load and the store instructions when possible.

Requirements:

  • Data in tensors has to be in shared or global memory. Copying from or to registers is not supported.

  • Both src and dst tensors must represent tensors of the same underlying element types (cublasdx::tensor<Engine, Layout>::value_type, Engine::value_type).

  • Both src and dst tensors must have the same size, i.e. the number of elements.

  • AlignmentInBytes must be a multiple of the alignment of the underlying element type of tensors.

  • AlignmentInBytes must be equal to 1, 2, 4, 8 or 16, or equal to the alignment of the underlying element type of tensors.

  • Underlying pointers in src and dst tensors must be aligned to AlignmentInBytes bytes.

// Synchronization step required after cublasdx::copy and before the use of dst tensor
__forceinline__ __device__ void copy_wait();

cublasdx::copy_wait creates synchronization point. It has to be called after cublasdx::copy operation, before any consequent read or write to dst tensor, and before any consequent write to src tensor. Otherwise, the result of copying operation is undefined. It’s important to note that it’s always not 1-to-1 equivalent of __syncthreads() as it also handles asynchronous data copying (see cp.async family of instructions)

Example

Example of copying A matrix from global to shared memory and back.

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Precision<float, float, double>() + Block() + ...);
extern __shared__ __align__(16) char smem[];

// Slice shared memory
auto [smem_a, smem_b, smem_c] = cublasdx::slice_shared_memory<BLAS>(smem);

auto gmem_tensor_a = cublasdx::make_tensor(a_gmem_pointer, BLAS::get_layout_gmem_a());
auto smem_tensor_a = cublasdx::make_tensor(smem_a, BLAS::suggest_layout_smem_a());

// Copy from global to shared
using alignment = cublasdx::alignment_of<BLAS>;
cublasdx::copy<BLAS, alignment::a>(gmem_tensor_a, smem_tensor_a);
cublasdx::copy_wait();

// Copy from shared to global
cublasdx::copy<BLAS, alignment::a>(smem_tensor_a, gmem_tensor_a);
cublasdx::copy_wait();

Copying registers tensors#

Register fragments can be copied to either global or shared memory tensors manually using partitioner API, or with use of cublasdx::copy_fragment.

Partitioner API allows for 2 distinct approaches to copying:

  1. Create a subtensor view of calling thread’s elements and copy using register fragment index space (see Partitioner example #1).

  2. Map local register fragment index to global index space and address global / shared tensor directly, allowing for location-aware processing (see Partitioner example #2).

The second way is based on a bidirectional copying method cublasdx::copy_fragment, which works per-thread:

#1 Store fragment: partition and copy from register fragment to global / shared memory tensor
template<unsigned AlignmentInBytes,    // Alignment of source tensor pointer
         class TRC, class CFragLayout, // Register Memory Fragment Tensor
         class TC, class CLayout,      // Global or Shared Memory tensor
         class Partitioner>
__forceinline__ __device__
copy_fragment(tensor<TRC, CFragLayout> const& tS, // Entire non-partitioned global / shared tensor
              tensor<TC, CLayout>           & tD, // Calling thread's register fragment tensor
              Partitioner              const& p);

#2 Load fragment: partition and copy from global / shared memory tensor to register fragment
template<unsigned AlignmentInBytes,    // Alignment of source tensor pointer
         class TRC, class CFragLayout, // Register Memory Fragment Tensor
         class TC, class CLayout,      // Global or Shared Memory tensor
         class Partitioner>
__forceinline__ __device__
copy_fragment(tensor<TC, CLayout>      const& tS,
              tensor<TRC, CFragLayout>      & tD,
              Partitioner              const& p);

Depending on the defined semantics, these can be thought of as gather and scatter functions:

  1. From global / shared tensor perspective: load fragment is scatter, store fragment is gather.

  2. From thread perspective: load fragment is gather, store fragment is scatter.

It’s important to notice that each of these functions takes:

  1. a template argument unsigned integer describing alignment of shared / global memory tensor,

  2. this thread’s register fragment tensor,

  3. an entire non-partitioned global / shared memory tensor, and

  4. a partitioner.

They are both safe to use with thread and value predicated fragment GEMMs, and will auto-vectorize to the extent described by passed alignment.

Example of use, to perform register API GEMM with accumulator:

auto partitioner = GEMM::get_partitioner();
auto c_register_fragment = partitioner.make_accumulator_fragment();

// Load fragment
cublasdx::copy_fragment<alignment::c>(c_global_tensor, c_register_fragment, partitioner);

GEMM().execute(a_shared_tensor, b_shared_tensor, c_register_fragment);

// Store fragment
cublasdx::copy_fragment<alignment::c>(c_register_fragment, c_global_tensor, partitioner);