Shared Memory Management#

cuSolverDx performs calculations on matrices that reside in shared memory. Because shared memory is a limited and shared resource, cuSolverDx provides tools to help users manage shared memory efficiently and correctly.

The tools include:

A calculator for the required shared memory size for custom kernels
A set of functions for splitting shared memory between multiple matrices or arrays

Shared Storage Size Calculator#

To simplify shared memory calculations, cuSolverDx offers a shared memory size calculator:

shared_storage_calculator make_shared_storage_calculator();

API Reference#

The calculator exposes the following API:

__host__ __device__
shared_storage_calculator& add(unsigned alignment, unsigned matrix_size_bytes);

__host__ __device__
shared_storage_calculator& add(unsigned alignment, unsigned elem_size, unsigned num_elements);

// layout can be either a CuTe layout object or a layout provided by cuBLASDx
template<class Layout>
__host__ __device__
shared_storage_calculator& add(unsigned alignment, unsigned elem_size, const Layout& layout);

// Returns calculated required shared memory in bytes
__host__ __device__
unsigned int get();

Usage Guidelines#

Use the shared memory size calculator for advanced scenarios when the shared_memory_size trait is insufficient. This is particularly useful when:

You need to call cuBLASDx and cuSolverDx in the same kernel
There are additional shared memory arrays required in the kernel
You need to combine multiple memory requirements

Example Usage#

auto shared_memory_size =
  cusolverdx::make_shared_storage_calculator()
     // cuSolverDx input matrix
     .add(alignof(typename POSVDescription::a_data_type), POSVDescription::shared_memory_size)
     // A and B matrices for cuBLASDx
     .add(cublasdx::alignment_of_v_a<BLAS>, sizeof(AInputType), BLAS::suggest_layout_smem_b())
     .add(cublasdx::alignment_of_v_b<BLAS>, sizeof(BInputType), BLAS::suggest_layout_smem_a())
     // Custom array of 64 elements
     .add(alignof(float), sizeof(float), 64)
     .get();

Note

This is the same calculator that cuBLASDx Shared Storage Size Calculator provides. The add() method has an overload that accepts CuTe layouts or any layout provided by cuBLASDx for seamless fusion with cuBLASDx.

Shared Memory Slicing#

cuSolverDx provides advanced, generic slicing APIs that allow you to split shared memory among any number of input matrices or arrays.

Note

The following functions are available in cusolverdx::shared_memory namespace, not in cusolverdx namespace. The same set of functions is also available in cuBLASDx Shared Memory Slicing.

API Reference#

// 1. Use type, alignment and number of elements to slice into pointers
template <class... Ts, class... Args>
__host__ __device__ __forceinline__
cute::tuple<Ts*...> slice_into_pointers(void* smem, const Args... args);

// 2. Use type, alignment and layout to slice into CuTe tensors
template <class... Ts, class... Args>
__host__ __device__ __forceinline__
auto slice_into_tensors(void* smem, const Args... args);

// 3. Accept types, alignments and either count of elements or layouts to get appropriate combination of outputs
template <class... PointerTypes, class... Args>
__host__ __device__ __forceinline__
auto slice(void* smem, const Args... args);

Function Variants#

The general slice() function offers the most flexibility in a single call, combining the functionality of the first two variants, which serve as simplified helpers:

slice_into_pointers: For basic pointer-based memory access
- Use when you need simple pointer access to memory regions
- Suitable for basic array operations
- Provides direct memory access
slice_into_tensors: For CuTe tensor-based memory access
- Use when working with CuTe layouts
- Provides tensor-based memory access
- Better integration with CuTe operations
slice: Combines both functionalities for maximum flexibility
- Can handle mixed pointer and tensor outputs
- Most versatile option
- Recommended for complex memory layouts

Example Usage#

// Slice shared memory into tensors and raw arrays (pointers)
auto [s_a_ptr, s_cublasdx_a_tensor, s_cubladx_b_tensor, s_array_ptr] =
    cusolverdx::shared_memory::slice<typename POSVDescription::a_data_type, AInputType, BInputType, float>(
        smem,
        alignof(typename POSVDescription::a_data_type), (lda * POSV::m_size),
        cublasdx::alignment_of_v_a<BLAS>, BLAS::suggest_layout_smem_a(),
        cublasdx::alignment_of_v_b<BLAS>, BLAS::suggest_layout_smem_b(),
        alignof(float), 64
    );

Depending on whether you provide a layout or a count of elements, the function returns either tensors or pointers.