Other Methods#

Overview#

This section describes additional utility methods provided by cuSolverDx for efficient memory operations and data management.

copy_2d Methods#

copy_2d is a device function that efficiently transfers matrix data between shared and global memory. It supports both explicit thread count specification and operation-based configuration.

Function Signatures#

// Variant 1: Explicit thread count
template<int Threads, unsigned int M, unsigned int N, arrangement Arrange, unsigned int Batches = 1, class DataType>
__device__ void copy_2d(const DataType* src,
                        const int ld_src,
                        DataType* dst,
                        const int ld_dst);

// Variant 2: Operation-based configuration
template<class Solver, unsigned int M, unsigned int N, arrangement Arrange, unsigned int Batches = 1, class DataType>
__device__ void copy_2d(const DataType* src,
                        const int ld_src,
                        DataType* dst,
                        const int ld_dst);

Parameters#

Threads: Number of threads participating in the copy operation
Operation: The solver operation type that defines thread count and leading dimensions
M: Number of rows in the matrix
N: Number of columns in the matrix
Arrange: Matrix arrangement (row_major or col_major)
Batches: Number of batches per CTA to process (default: 1)
DataType: The data type of the matrix elements
src: Pointer to the source matrix
ld_src: Leading dimension of the source matrix
dst: Pointer to the destination matrix
ld_dst: Leading dimension of the destination matrix

Function Variants#

The copy_2d functions offer two variants:

Explicit thread count specification for fine-grained control
Operation-based configuration that uses the operation’s thread count

Note

The copy is done cooperatively. All threads, indicated either by Threads or by Solver::block_dim, will participate in the copy. The function takes into account of the alignment of the data type and attempts to vectorize the load and the store instructions when possible.

Data in src and dst has to be in shared or global memory. Copying to or from registers is not supported.

The function requires appropriate synchronization after use (typically __syncthreads()).

Usage Examples#

Basic Usage#

// Example 1: Using operation-based configuration of copy_2d functions
// copy from the global memory to shared memory
cusolverdx::copy_2d<Solver, M, N, cusolverdx::arrangement_of_v_a<Solver>, Solver::BatchesPerBlock>(Ag, lda_g, As, lda_s);
__syncthreads();

// run the solver functions
Solver::execute(As, lda_s, ...);

// copy from the shared memory to the global memory
__syncthreads();
cusolverdx::copy_2d<Solver, M, N, cusolverdx::arrangement_of_v_a<Solver>, Solver::BatchesPerBlock>(As, lda_s, Ag, lda_g);

Using Wrapper Functions#

// Example 2: Using a set of convenient wrapper functions of copy_2d provided in example/common/device_io.hpp
# include "example/common/device_io.hpp"

common::io<Solver, Solver::BatchesPerBlock>::load_a(Ag, lda_g, As, lda_s);
common::io<Solver, Solver::BatchesPerBlock>::load_b(Bg, ldb_g, Bs, ldb_s);

// run the solver functions
Solver::execute(As, lda_s, ...);

// Store results back to global memory
common::io<Solver, Solver::BatchesPerBlock>::store_a(As, lda_s, Ag, lda_g);
common::io<Solver, Solver::BatchesPerBlock>::store_b(Bs, ldb_s, Bg, ldb_g);