Tensors#
cuBLASDx Tensor#
cuBLASDx exposes cublasdx::tensor
class which is an alias of a
cute::Tensor class from CuTe
library (see CUTLASS).
Tensor Creation#
// #1: Wrapper of cute::make_tensor
template<class Iterator, class... Args>
__device__ __host__
constexpr cublasdx::tensor make_tensor(const Iterator& iter, Args const&... args);
// #2: With pointer layout returned by the get_<smem/gmem>_layout_<a/b/c>, suggest_layout_smem_<a/b/c> method from the BLAS description.
template<class T, class PointerLayout>
__device__ __host__
constexpr cublasdx::tensor make_tensor(T* ptr, const PointerLayout& pl);
cublasdx::make_tensor
is helper function for creating cublasdx::tensor
objects.
There are two variants. The first one is simply a wrapper of cute::make_tensor(…), which usually requires manually tagging the raw pointer with its memory space. The other one works together with the Get Memory Layout and the Suggested shared memory Layout methods. It creates a global or shared memory tensor with the returned pointer layout. In contrast to the first variant, it will pick up the memory space information from the pointer layouts and tag the raw pointer correspondingly.
Example
using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Precision<float, float, double>() + Block() + ...);
// tensor with global memory data
auto a_global_tensor = cublasdx::make_tensor(a, BLAS::get_layout_gmem_a());
// tensor with shared memory data
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::get_layout_smem_a());
auto a_shared_tensor = cublasdx::make_tensor(smem_a, BLAS::suggest_layout_smem_a());
Partitioner And Register Fragment Tensors#
cuBLASDx 0.3.0
introduces 2 new memory objects: register fragment tensors and partitioners.
Since register memory is limited to a thread, it’s not used to store entire tensors, only small parts of them. A division of a global / shared memory tensor into threads taking part in a BLAS is called a partitioning. In cuBLASDx the BLAS operation is a source of partitioning pattern.
In every BLAS execution each thread is assigned some elements of result matrix C
which it calculates. This selection
is dependent upon 3 things:
number of threads,
chosen MMA instruction, and
tiling chosen for the MMA instruction.
User of cuBLASDx has direct control over only 1 of these 3 options, hence the partitioning information must be opaque
and used only with BLAS object which defined it. cuBLASDx exposes this opaque information by means of a partitioner
.
Partitioner#
Partitioner objects can be obtained from BLAS type in 2 ways:
// #1a As pre-requested default partitioner
auto partitioner = BLAS::get_partitioner();
// #1b As pre-requested suggested partitioner
auto partitioner = BLAS::suggest_partitioner();
// #2 As result of register API without accumulator call
auto [c_register_fragment, partitioner] = BLAS().execute(a_shared_tensor, b_shared_tensor);
Note
Options #1a and #1b require the user to choose version coherent with used layouts. Suggested
partitioner only with suggested layouts and default
partitioner with any other layouts. Mixing them may result in subpar performance.
Partitioner is an opaque object exposing the following API:
__device__ bool is_predicated();
__device__ bool is_thread_active();
__device__ constexpr auto make_accumulator_fragment();
template<class CTensor>
__forceinline__ __device__
auto partition_like_C(CTensor && ctensor) const;
template<class ... Coords>
__forceinline__ __device__
auto map_fragment_index(Coords&& ... coords) const;
template<class ... Coords>
__forceinline__ __device__
bool is_index_in_bounds(Coords&& ... coords) const;
There are 3 methods that can be used to get basic information about the partitioning pattern, or retrieve a register fragment.
// Partitioning properties
__device__ bool is_predicated();
__device__ bool is_thread_active();
// Accumulator creation, creates a register cublasdx::tensor
__device__ constexpr auto make_accumulator_fragment();
As mentioned in Partitioner And Register Fragment Tensors the division of elements among threads happens by tiling
an MMA instruction over the BLAS defined problem size. Each instruction is responsible for computing certain shape, and
it’s possible that the problem shape is not divisible by primitive instruction shape. In such cases the “extra” instruction
elements are filled with 0s instead of reading from memory, and skipped when storing results. This is called element
predication. The is_predicated()
member function can be used to retrieve that information.
Due to cuBLASDx supporting execution within kernels with CUDA thread block sizes not matching the BlockDim Operator,
not always will all threads take part in a GEMM operation. This implies that some threads may not have any elements
assigned to them. is_thread_active_result()
member function can be used to check exactly whether this is the case
for the calling thread.
Whether the return bool value can be known at compile time or runtime will depend on the specific BLAS configuration used.
The final argumentless method - make_accumulator_fragment()
can be used to retrieve a non-initialized register fragment
tensor corresponding to BLAS execution from which the partitioner was retrieved.
Please see code below for an example of use of all 3 methods.
template<class CTensor>
__forceinline__ __device__
auto partition_like_C(CTensor && ctensor) const;
This method will return a non-owning view of its argument’s subtensor assigned to the calling thread, corresponding to its local register fragment.
All these methods can be used for manual copying of unpredicated GEMM as follows:
Partitioner example #1#
auto c_global_tensor = ...;
// Get partitioner
auto partitioner = BLAS::get_partitioner();
// Create local register fragment
auto c_register_fragment = partitioner.make_accumulator_fragment();
// Get view of this thread's global memory subtensor
auto c_global_partition = partitioner.partition_like_C(c_global_tensor);
// Ensure that all elements are in-bounds and
// no predication is necessary
static_assert(not partitioner.is_predicated());
// If this thread takes part in GEMM
if(partitioner.is_thread_active()) {
// For each element of register fragment
for(int i = 0; i < cublasdx::size(c_register_fragment); ++i) {
// Copy respective global element into it
c_register_fragment(i) = c_global_partition(i);
}
}
// C += A * B
BLAS().execute(a_shared_tensor, b_shared_tensor, c_register_fragment);
If a GEMM is predicated, more information will be necessary:
template<class ... Coords>
__forceinline__ __device__
auto map_fragment_index(Coords&& ... coords) const;
template<class ... Coords>
__forceinline__ __device__
bool is_index_in_bounds(Coords&& ... coords) const;
These 2 functions extend functionality of is_predicated()
allowing to map local register fragment index to
its source (global or shared) tensor index, as well as check if this index is in bounds.
All these methods can be used to perform a bound-checked loading of elements from global to register tensor:
Partitioner example #2#
auto c_global_tensor = ...;
// Get partitioner
auto partitioner = BLAS::get_partitioner();
// Create local register fragment
auto c_register_fragment = partitioner.make_accumulator_fragment();
// If this thread takes part in GEMM
if(partitioner.is_thread_active()) {
// For each element of register fragment
for(int i = 0; i < cublasdx::size(c_register_fragment); ++i) {
auto global_index = partitioner.map_fragment_index(i);
if((not partitioner.is_predicated()) or partitioner.is_index_in_bounds(i)) {
// Copy respective global element into it
c_register_fragment(i) = load_op(c_global_tensor(global_index));
}
}
}
// C += A * B
BLAS().execute(a_shared_tensor, b_shared_tensor, c_register_fragment);
It would be very tedious to have to do such loading manually every time, hence cuBLASDx offers optimized and auto-vectorizing global/shared ⟷ register fragment copying functions. See Copying registers tensors for more information
Imported Tensor Utilities#
Since cuBLASDx integrated CuTe library, it exposes some of its tensor and layout functionality, sometimes adding compatibility layers to enable its own types support. Currently following functions are supported:
// Aliased
using cute::clear;
using cute::transform;
using cute::make_fragment_like;
// With compatibility layers
cublasdx::size; // (cute::size)
cublasdx::cosize; // (cute::cosize)
cublasdx::axpby; // (cute::axpby)
cublasdx::clear
takes a register fragment as input and sets all its values to 0.
auto partitioner = BLAS::get_partitioner();
auto c_register_fragment = partitioner.make_accumulator_fragment();
cublasdx::clear(c_register_fragment);
cublasdx::transform
applies an elementwise functor to all elements of passed register tensor.
auto c_global_tensor = ...
auto partitioner = BLAS::get_partitioner();
auto c_register_fragment = partitioner.make_accumulator_fragment();
cublasdx::copy_fragment<alignment_of<BLAS>::c>(c_global_tensor, c_register_fragment, partitioner);
// in-place
cublasdx::transform(c_register_fragment, [](auto v) { return 2 * v; });
// out-of-place
auto d_register_fragment = partitioner.make_accumulator_fragment();
cublasdx::transform(c_register_fragment, d_register_fragment, [](auto v) { return 2 * v; });
cublasdx::make_fragment_like
can be used to create a register fragment with the same layout and type as its argument tensor.
auto c_global_tensor = ...
auto partitioner = BLAS::get_partitioner();
auto c_global_partition = partitioner.partition_like_C(c_global_tensor);
// Same type
auto c_register_fragment = cublasdx::make_fragment_like(c_global_partition);
// Other type
using new_type = double;
auto c_register_fragment = cublasdx::make_fragment_like<new_type>(c_global_partition);
Warning
The following functions should be used only in their cuBLASDx form, as it offers compatibility layers and automatic conversions.
cublasdx::axpby
performs a \(\mathbf{D}_{m\times n} = {\alpha} \times \mathbf{C}_{m\times n} + {\beta} \times \mathbf{D}_{m\times n}\)
auto c_global_tensor = ...
auto a_shared_tensor = ...
auto b_shared_tensor = ...
auto partitioner = BLAS::get_partitioner();
auto c_register_fragment = partitioner.make_accumulator_fragment();
auto d_register_fragment = partitioner.make_accumulator_fragment();
cublasdx::copy_fragment<alignment_of<BLAS>::c>(c_global_tensor, c_register_fragment, partitioner);
// These 2 functions combined perform the classic GEMM: C = alpha * A * B + beta * C;
BLAS().execute(a_shared_tensor, b_shared_tensor, c_register_fragment);
cublasdx::axpby(alpha, c_register_fragment, beta, d_register_fragment);
cublasdx::copy_fragment<alignment_of<BLAS>::c>(d_register_fragment, c_global_tensor, partitioner);
cublasdx::size
returns a number of valid elements in a tensor. This is simply a product of all shape
dimensions.
auto a_shared_layout = BLAS::get_layout_smem_a();
// Same as size_of<BLAS>::m * size_of<BLAS>::k
constexpr auto layout_size = cublasdx::size(a_shared_layout);
cublasdx::cosize
returns a distance from last element of a tensor to its first element. It describes how many elements
does the argument layout span. This is the same as cublasdx::size
for layouts which are compact, or have no “holes”
(e.g. as result of extra leading dimension elements)
// Default leading dimension
auto BLAS = decltype(... + Size<M, N, K>() + LeadingDimension<M, K, M>());
auto a_shared_layout = BLAS::get_layout_smem_a();
constexpr auto layout_size = cublasdx::size(a_shared_layout);
constexpr auto layout_cosize = cublasdx::cosize(a_shared_layout);
static_assert(layout_size == layout_cosize);
// Extra elements leading dimension
constexpr auto lda = M + 1; // Add padding to avoid shared memory bank conflicts
auto BLAS = decltype(... + Size<M, N, K>() + LeadingDimension<lda, K, M>());
auto a_shared_layout = BLAS::get_layout_smem_a();
constexpr auto layout_size = cublasdx::size(a_shared_layout);
constexpr auto layout_cosize = cublasdx::cosize(a_shared_layout);
static_assert(layout_size != layout_cosize);
static_assert(layout_size < layout_cosize);
Copying Tensors#
Copying registers tensors#
Register fragments can be copied to either global or shared memory tensors manually using partitioner API
, or with use of cublasdx::copy_fragment
.
Partitioner API allows for 2 distinct approaches to copying:
Create a subtensor view of calling thread’s elements and copy using register fragment index space (see Partitioner example #1).
Map local register fragment index to global index space and address global / shared tensor directly, allowing for location-aware processing (see Partitioner example #2).
The second way is based on a bidirectional copying method cublasdx::copy_fragment
, which works per-thread:
#1 Store fragment: partition and copy from register fragment to global / shared memory tensor
template<unsigned AlignmentInBytes, // Alignment of source tensor pointer
class TRC, class CFragLayout, // Register Memory Fragment Tensor
class TC, class CLayout, // Global or Shared Memory tensor
class Partitioner>
__forceinline__ __device__
copy_fragment(tensor<TRC, CFragLayout> const& tS, // Entire non-partitioned global / shared tensor
tensor<TC, CLayout> & tD, // Calling thread's register fragment tensor
Partitioner const& p);
#2 Load fragment: partition and copy from global / shared memory tensor to register fragment
template<unsigned AlignmentInBytes, // Alignment of source tensor pointer
class TRC, class CFragLayout, // Register Memory Fragment Tensor
class TC, class CLayout, // Global or Shared Memory tensor
class Partitioner>
__forceinline__ __device__
copy_fragment(tensor<TC, CLayout> const& tS,
tensor<TRC, CFragLayout> & tD,
Partitioner const& p);
Depending on the defined semantics, these can be thought of as gather and scatter functions:
From global / shared tensor perspective: load fragment is scatter, store fragment is gather.
From thread perspective: load fragment is gather, store fragment is scatter.
It’s important to notice that each of these functions takes:
a template argument
unsigned integer
describing alignment of shared / global memory tensor,this thread’s register fragment tensor,
an entire non-partitioned global / shared memory tensor, and
a partitioner.
They are both safe to use with thread and value predicated fragment GEMMs, and will auto-vectorize to the extent described by passed alignment.
Example of use, to perform register API GEMM with accumulator
:
auto partitioner = GEMM::get_partitioner();
auto c_register_fragment = partitioner.make_accumulator_fragment();
// Load fragment
cublasdx::copy_fragment<alignment::c>(c_global_tensor, c_register_fragment, partitioner);
GEMM().execute(a_shared_tensor, b_shared_tensor, c_register_fragment);
// Store fragment
cublasdx::copy_fragment<alignment::c>(c_register_fragment, c_global_tensor, partitioner);