Other Methods#
Overview#
This section describes additional utility methods provided by cuSolverDx for efficient memory operations and data management.
copy_2d Methods#
copy_2d
is a device function that efficiently transfers matrix data between shared and global memory. It supports both explicit thread count specification and operation-based configuration.
Function Signatures#
// Variant 1: Explicit thread count
template<int Threads, unsigned int M, unsigned int N, arrangement Arrange, unsigned int Batches = 1, class DataType>
__device__ void copy_2d(const DataType* src,
const int ld_src,
DataType* dst,
const int ld_dst);
// Variant 2: Operation-based configuration
template<class Solver, unsigned int M, unsigned int N, arrangement Arrange, unsigned int Batches = 1, class DataType>
__device__ void copy_2d(const DataType* src,
const int ld_src,
DataType* dst,
const int ld_dst);
Parameters#
Threads
: Number of threads participating in the copy operationOperation
: The solver operation type that defines thread count and leading dimensionsM
: Number of rows in the matrixN
: Number of columns in the matrixArrange
: Matrix arrangement (row_major or col_major)Batches
: Number of batches per CTA to process (default: 1)DataType
: The data type of the matrix elementssrc
: Pointer to the source matrixld_src
: Leading dimension of the source matrixdst
: Pointer to the destination matrixld_dst
: Leading dimension of the destination matrix
Function Variants#
The copy_2d
functions offer two variants:
Explicit thread count specification for fine-grained control
Operation-based configuration that uses the operation’s thread count
Note
The copy is done cooperatively. All threads, indicated either by Threads
or by Solver::block_dim
, will participate in the copy. The function takes into account of the alignment of the data type and attempts to vectorize the load and the store instructions when possible.
Data in src
and dst
has to be in shared or global memory. Copying to or from registers is not supported.
The function requires appropriate synchronization after use (typically __syncthreads()
).
Usage Examples#
Basic Usage#
// Example 1: Using operation-based configuration of copy_2d functions
// copy from the global memory to shared memory
cusolverdx::copy_2d<Solver, M, N, cusolverdx::arrangement_of_v_a<Solver>, Solver::BatchesPerBlock>(Ag, lda_g, As, lda_s);
__syncthreads();
// run the solver functions
Solver::execute(As, lda_s, ...);
// copy from the shared memory to the global memory
__syncthreads();
cusolverdx::copy_2d<Solver, M, N, cusolverdx::arrangement_of_v_a<Solver>, Solver::BatchesPerBlock>(As, lda_s, Ag, lda_g);
Using Wrapper Functions#
// Example 2: Using a set of convenient wrapper functions of copy_2d provided in example/common/device_io.hpp
# include "example/common/device_io.hpp"
common::io<Solver, Solver::BatchesPerBlock>::load_a(Ag, lda_g, As, lda_s);
common::io<Solver, Solver::BatchesPerBlock>::load_b(Bg, ldb_g, Bs, ldb_s);
// run the solver functions
Solver::execute(As, lda_s, ...);
// Store results back to global memory
common::io<Solver, Solver::BatchesPerBlock>::store_a(As, lda_s, Ag, lda_g);
common::io<Solver, Solver::BatchesPerBlock>::store_b(Bs, ldb_s, Bg, ldb_g);