Other¶

Tensor¶

cuBLASDx exposes cublasdx::tensor class which is an alias of a cute::Tensor class from CuTe library (see CUTLASS).

Tensor Creation¶

// #1: Wrapper of cute::make_tensor
template<class Iterator, class... Args>
__device__ __host__
constexpr cublasdx::tensor make_tensor(const Iterator& iter, Args const&... args);

// #2: With pointer layout returned by the get_<smem/gmem>_layout_<a/b/c>, suggest_layout_smem_<a/b/c> method from the BLAS description.
template<class T, class PointerLayout>
__device__ __host__
constexpr cublasdx::tensor make_tensor(T* ptr, const PointerLayout& pl);

cublasdx::make_tensor is helper function for creating cublasdx::tensor objects.

There are two variants. The first one is simply a wrapper of cute::make_tensor(…), which usually requires manually tagging the raw pointer with its memory space. The other one works together with the Get Memory Layout and the Suggested Shared Memory Layout methods. It creates a global or shared memory tensor with the returned pointer layout. In contrast to the first variant, it will pick up the memory space information from the pointer layouts and tag the raw pointer correspondingly.

Example

Copying Tensors¶

template<uint32_t NumThreads,       // Number of threads performing copy operation
         uint32_t AlignmentInBytes, // Pointer alignment of src and dst tensor (minimum of them if they are different)
         class SrcEngine,
         class SrcLayout,
         class DstEngine,
         class DstLayout>
__forceinline__ __device__
void copy(const unsigned int                            tid, // Thread index in CUDA block
          const cublasdx::tensor<SrcEngine, SrcLayout>& src,
          cublasdx::tensor<DstEngine, DstLayout>&       dst)

// Assumes pointers in both dst and src tensors are not extra aligned
template<uint32_t NumThreads, // Number of threads performing copy operation
         class SrcEngine,
         class SrcLayout,
         class DstEngine,
         class DstLayout>
__forceinline__ __device__
void copy(const unsigned int                            tid, // Thread index in CUDA block
          const cublasdx::tensor<SrcEngine, SrcLayout>& src,
          cublasdx::tensor<DstEngine, DstLayout>&       dst)

template<class BLAS,                // BLAS description which provides the number of threads
         uint32_t AlignmentInBytes, // Pointer alignment of src and dst tensor (minimum of them if they are different)
         class SrcEngine,
         class SrcLayout,
         class DstEngine,
         class DstLayout>
__forceinline__ __device__
void copy(const cublasdx::tensor<SrcEngine, SrcLayout>& src,
          cublasdx::tensor<DstEngine, DstLayout>&       dst)

cublasdx::copy is helper function for copying data between tensors that are either in shared or global memory.

The copy is done cooperatively. All threads, indicated either by NumThreads or by BLAS::block_dim, will participate in the copy. The function takes into account of the given alignments and attempt to vectorize the load and the store instructions when possible.

Requirements:

Data in tensors has to be in shared or global memory. Copying from or to registers is not supported.
Both src and dst tensors must represent tensors of the same underlying element types (cublasdx::tensor<Engine, Layout>::value_type, Engine::value_type).
Both src and dst tensors must have the same size, i.e. the number of elements.
AlignmentInBytes must be a multiple of the alignment of the underlying element type of tensors.
AlignmentInBytes must be equal to 1, 2, 4, 8 or 16, or equal to the alignment of the underlying element type of tensors.
Underlying pointers in src and dst tensors must be aligned to AlignmentInBytes bytes.

// Synchronization step required after cublasdx::copy and before the use of dst tensor
__forceinline__ __device__ void copy_wait();

cublasdx::copy_wait creates synchronization point. It has to be called after cublasdx::copy operation, before any consequent read or write to dst tensor, and before any consequent write to src tensor. Otherwise, the result of copying operation is undefined. It’s important to note that it’s always not 1-to-1 equivalent of __syncthreads() as it also handles asynchronous data copying (see cp.async family of instructions)

Example

Example of copying A matrix from global to shared memory and back.

using BLAS = decltype(Size<128, 128, 128>() + Type<type::real>() + Precision<float, float, double>() + Block() + ...);
extern __shared__ __align__(16) char smem[];

// Slice shared memory
auto [smem_a, smem_b, smem_c] = BLAS::slice_shared_memory(smem);

auto gmem_tensor_a = cublasdx::make_tensor(a_gmem_pointer, BLAS::get_layout_gmem_a());
auto smem_tensor_a = cublasdx::make_tensor(smem_a, BLAS::suggest_layout_smem_a());

// Copy from global to shared
using alignment = cublasdx::alignment_of<BLAS>;
cublasdx::copy<GEMM, alignment::a>(gmem_tensor_a, smem_tensor_a);
cublasdx::copy_wait();

// Copy from shared to global
cublasdx::copy<GEMM, alignment::a>(smem_tensor_a, gmem_tensor_a);
cublasdx::copy_wait();