Other Methods#

Overview#

This section describes additional utility methods provided by cuSolverDx for efficient memory operations and data management.

copy_2d Methods#

copy_2d is a device function that efficiently transfers matrix data between shared and global memory. It supports both explicit thread count specification and operation-based configuration.

Function Signatures#

// Variant 1: Explicit thread count
template<int Threads, unsigned int M, unsigned int N, arrangement Arrange, unsigned int Batches = 1, class DataType>
__device__ void copy_2d(const DataType* src,
                        const int ld_src,
                        DataType* dst,
                        const int ld_dst);

// Variant 2: Operation-based configuration
template<class Solver, unsigned int M, unsigned int N, arrangement Arrange, unsigned int Batches = 1, class DataType>
__device__ void copy_2d(const DataType* src,
                        const int ld_src,
                        DataType* dst,
                        const int ld_dst);

Parameters#

  • Threads: Number of threads participating in the copy operation

  • Operation: The solver operation type that defines thread count and leading dimensions

  • M: Number of rows in the matrix

  • N: Number of columns in the matrix

  • Arrange: Matrix arrangement (row_major or col_major)

  • Batches: Number of batches per CTA to process (default: 1)

  • DataType: The data type of the matrix elements

  • src: Pointer to the source matrix

  • ld_src: Leading dimension of the source matrix

  • dst: Pointer to the destination matrix

  • ld_dst: Leading dimension of the destination matrix

Function Variants#

The copy_2d functions offer two variants:

  1. Explicit thread count specification for fine-grained control

  2. Operation-based configuration that uses the operation’s thread count

Note

The copy is done cooperatively. All threads, indicated either by Threads or by Solver::block_dim, will participate in the copy. The function takes into account of the alignment of the data type and attempts to vectorize the load and the store instructions when possible.

Data in src and dst has to be in shared or global memory. Copying to or from registers is not supported.

The function requires appropriate synchronization after use (typically __syncthreads()).

Usage Examples#

Basic Usage#

// Example 1: Using operation-based configuration of copy_2d functions
// copy from the global memory to shared memory
cusolverdx::copy_2d<Solver, M, N, cusolverdx::arrangement_of_v_a<Solver>, Solver::BatchesPerBlock>(Ag, lda_g, As, lda_s);
__syncthreads();

// run the solver functions
Solver::execute(As, lda_s, ...);

// copy from the shared memory to the global memory
__syncthreads();
cusolverdx::copy_2d<Solver, M, N, cusolverdx::arrangement_of_v_a<Solver>, Solver::BatchesPerBlock>(As, lda_s, Ag, lda_g);

Using Wrapper Functions#

// Example 2: Using a set of convenient wrapper functions of copy_2d provided in example/common/device_io.hpp
# include "example/common/device_io.hpp"

common::io<Solver, Solver::BatchesPerBlock>::load_a(Ag, lda_g, As, lda_s);
common::io<Solver, Solver::BatchesPerBlock>::load_b(Bg, ldb_g, Bs, ldb_s);

// run the solver functions
Solver::execute(As, lda_s, ...);

// Store results back to global memory
common::io<Solver, Solver::BatchesPerBlock>::store_a(As, lda_s, Ag, lda_g);
common::io<Solver, Solver::BatchesPerBlock>::store_b(Bs, ldb_s, Bg, ldb_g);