cuDSS Data Types#
Opaque Data Structures#
cudssHandle_t
#
The structure holds the cuDSS library context (device properties, system information, execution controls like cudaStream_t, etc.).The handle must be initialized prior to calling any other cuDSS API with cudssCreate(). The handle must be destroyed to free up resources after using cuDSS with cudssDestroy().
cudssMatrix_t
#
The structure is a lightweight wrapper around standard dense/sparse matrix parameters and does not own any data arrays. Matrix objects are used to pass matrix of the linear system, as well as solution and right-hand side (even if these are in fact vectors). Currently, cuDSS matrix objects can have either one of the two underlying matrix formats: dense and 3-array CSR (sparse).Matrix objects should be created via cudssMatrixCreateDn() (for dense matrices) or cudssMatrixCreateBatchDn() (for a batch of dense matrices) or cudssMatrixCreateCsr() (for sparse matrices in CSR format) or cudssMatrixCreateBatchCsr() (for a batch of sparse matrices in CSR format). After use, matrix objects should be destroyed via cudssMatrixDestroy().Matrix objects can be modified after creation via cudssMatrixSetValues() and cudssMatrixSetCsrPointers().Information can be retrieved from a matrix object by calling cudssMatrixGetFormat() followed by either cudssMatrixGetDn() or cudssMatrixGetCsr() depending on the format returned.
cudssData_t
#
The structure holds internal data (e.g., factors related data structures) as well as pointers to user-provided data. A single object of this type should be associated with solving a specific linear system. If multiple systems with the same datatype(!) are solved consecutively the object can be re-used (all necessary internal buffers will be re-created per necessity).Note: by default, the library allocates device memory required for performing LU factorization and storing the LU factors internally. All data buffers are of this kind are kept inside the data object. To change this default behavior, one can set a cudssDeviceMemHandler_t which will then be used for allocating device memory inside the solver.The object should be created via cudssDataCreate() and destroyed via cudssDataDestroy().During execution of any of the stages cudssExecute(), configuration settings of the solver are read fromcudssConfig_t
and thus affect the execution and internal data stored in thecudssData_t
object.Data parameters can be updated or retrieved by calling cudssDataSet() or cudssDataGet() respectively.
cudssConfig_t
#
The structure stores configuration settings for the solver. This object is a lightweight (host-side) wrapper around common solver settings. While it can be re-used for solving different linear systems, it is recommended to have one per linear system.The object should be created via cudssConfigCreate() and destroyed via cudssConfigDestroy().During execution of any of the stages cudssExecute(), configuration settings of the solver are read fromcudssConfig_t
and thus affect the execution.Configuration settings can be updated or retrieved by calling cudssConfigSet() or cudssConfigGet() respectively. Note: certain settings need to be set before a corresponding solver stage is executed (e.g., reordering algorithm must be set prior to the phaseCUDSS_PHASE_ANALYSIS
).
Non-opaque Data Structures#
cudssDeviceMemHandler_t
#
This structure holds information about the user-provided, stream-ordered device memory pool (mempool).The object can be created by setting the struct members described below.Once created, a device memory handler can be set for the cuDSS library handle via cudssSetDeviceMemhandler().Once set for the cuDSS library handle, information about the set device memory handler can be retrieved via cudssGetDeviceMemhandler().
Members:
void *ctx
A pointer to the user-owned mempool/context object.int (*device_alloc)(void *ctx, void **ptr, size_t size, cudaStream_t stream)
A function pointer to the user-provided routine for allocating device memory of size onstream
.The allocated memory should be made accessible to the current device (or more precisely, to the current CUDA context bound to the library handle).This interface supports any stream-ordered memory allocatorctx
. Upon success, the allocated memory can be immediately used on the given stream by any operations enqueued/ordered on the same stream after this call.It is the caller’s responsibility to ensure a proper stream order is established.The allocated memory should be at least 256-byte aligned.
Parameters
In/Out
Description
ctx
In
A pointer to the user-owned mempool object.
ptr
Out
On success, a pointer to the allocated buffer.
size
In
The amount of memory in bytes to be allocated.
stream
In
The CUDA stream on which the memory is allocated (and the stream order is established)
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.int (*device_free)(void *ctx, void *ptr, size_t size, cudaStream_t stream)
A function pointer to the user-provided routine for deallocating device memory ofsize
onstream
.This interface supports any stream-ordered memory allocator. Upon success, any subsequent accesses (of the memory pointed to by the pointerptr
) ordered after this call are undefined behaviors.It is the caller’s responsibility to ensure a proper stream order is established.If the argumentsctx
andsize
are not the same as those passed todevice_alloc
to allocate the memory pointed to byptr
, the behavior is undefined.The argumentstream
need not be identical to the one used for allocatingptr
, as long as the stream order is correctly established. The behavior is undefined if this assumption is not held.
Parameters
In/Out
Description
ctx
IN
A pointer to the user-owned mempool object.
ptr
IN
The pointer to the allocated buffer.
size
IN
The size of the allocated memory.
stream
IN
The CUDA stream on which the memory is deallocated (and the stream order is established).
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.char name[CUDSS_ALLOCATOR_NAME_LEN]
The name of the provided mempool (must not exceed 64 characters).
Enumerators#
cudssStatus_t
#
The enumerator specifies possible status values (on the host) which can be returned from calls to cuDSS routines.Note: device side failures are returned viaCUDSS_DATA_INFO
from cudssDataParam_t.
Value |
Description |
---|---|
|
The operation completed successfully. |
|
One of the input operands was not properly initialized prior to the call to a cuDSS routine. This can usually be one of the opaque objects like cudssHandle_t, cudssData_t or others. |
|
Resource allocation failed inside the cuDSS library. This is usually caused by a device memory allocation (cudaMalloc()) or by a host memory allocation failure |
|
An incorrect value or parameter was passed to the function (a negative vector size, or a a NULL pointer for a must-have buffer,for example) |
|
An unsupported (but otherwise reasonable parameter was passed to the function |
|
The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons |
|
An internal cuDSS operation failed |
cudssConfigParam_t
#
The enumerator specifies possible names of solver configuration settings. For each setting there is a matching type to be used in cudssConfigSet() or cudssConfigGet().
Value |
Description |
---|---|
|
Algorithm for the reordering phase Associated parameter type: cudssAlgType_t Note: Note: Default value: |
|
Algorithm for the factorization phase Associated parameter type: cudssAlgType_t Default value: |
|
Algorithm for the solving phase Associated parameter type: cudssAlgType_t Default value: |
|
Algorithm for the pivot epsilon calculation Associated parameter type: cudssAlgType_t
Note: Default value: |
|
Type of matching (on/off) Associated parameter type: Default value: |
|
Potential modificator on the system matrix (e.g. transpose or conjugate transpose) Associated parameter type: Default value: |
|
Number of steps during the iterative refinement Associated parameter type: Default value: |
|
Iterative refinement tolerance Associated parameter type: Currently it is ignored (exactly |
|
Type of pivoting The exact meaning of this parameter depends on the choice of the reordering algorithm
which can be changed through Note that in the latter case, if the matrix type is symmetric but
not positive-definite, then the pivoting is also symmetric and the pivot is searched for
on the diagonal of the block, if only the pivot type is not equal to Associated parameter type: cudssPivotType_t Default value: |
|
Pivoting threshold \(p_{threshold}\) which is used to determine if diagonal element is subject to pivoting and will be swapped with the maximum element in the row (or column) depending on the type of pivoting. The diagonal element will be swapped if: \(p_{threshold} \cdot max_{(sub)row \, or \, col} |a_{ij}| \geq |a_{ii}|\) Associated parameter type: Default value: Currently it is only supported for |
|
Pivoting epsilon, absolute value to replace singular diagonal elements Associated parameter type: Default value: |
|
Upper limit on the number of nonzero entries in LU factors. This is only relevant for
non-symmetric matrices and reordering algorithm set to Associated parameter type: Default value: |
|
Memory mode: Note: Hybrid memory mode should be enabled before the analysis phase
(cudssExecute() with Note: Unlike the hybrid execution mode (see CUDSS_CONFIG_HYBRID_EXECUTE_MODE) which controls where compute kernels are executed, the hybrid memory mode (‘CUDSS_CONFIG_HYBRID_MODE’) only allows cuDSS to keep part of the factor values (internal data) on the host and always uses GPU for factorization and solve. For more details regarding the hybrid memory mode, see Hybrid mode feature. Associated parameter type: Default value: |
|
User-defined device memory limit (number of bytes) for the hybrid memory mode. This setting only affects execution when the hybrid memory mode is enabled. For more details regarding the hybrid memory mode, see Hybrid mode feature. Associated parameter type: Default value: |
|
A flag to enable or disable usage of cudaHostRegister() by cuDSS hybrid memory mode. Since the hybrid memory mode of cuDSS uses host memory to store the factors, it can use
cudaHostRegister() (if the HW supports it) to speedup associated host-to-device and
device-to-host memory transfers. However, registering host memory has limitations and
in some cases might lead to slowdowns.
If the flag is not set to This setting only affects execution when the hybrid memory mode is enabled. For more details regarding the hybrid memory mode, see Hybrid mode feature. Associated parameter type: Default value: |
|
Number of threads to be used by cuDSS in MT mode. This setting only affects execution when the multi-threaded mode is enabled. Associated parameter type: Default value: |
|
Execute mode: Hybrid execute mode allows cuDSS to perform calculations on both GPU and CPU. Currently it is used to speed up execution parts with low parallelization capacity. Note: Reordering part of the analysis step is performed on CPU regardless of execute mode value. Note: Hybrid execute mode should be enabled before the analysis phase
(cudssExecute() with Note: Unlike the hybrid execution mode ( Note: If hybrid execute mode is enabled input matrix, right hand side and solution can be host memory pointers. For more details regarding the hybrid execute mode, see Hybrid execute mode feature. Note: Host memory data is supported for Associated parameter type: Default value: Currently not supported when |
cudssDataParam_t
#
The enumerator specifies possible parameter names which can set or get in the cudssData_t object. For each parameter name there is an associated type to be used in cudssDataSet() or cudssDataGet(). Each parameter name is marked with “in”, “out” or “inout” depending on whether a parameter can be only set, get or be involved in both.
Value |
Description |
---|---|
|
Device-side error code. One of the noticeable use cases is when a matrix of the system is passed with an Direction: out Memory: host Associated parameter type: int |
|
Number of non-zero entries in LU factors. Direction: out Memory: host Associated parameter type: |
|
Number of pivots encountered during factorization. Direction: out Memory: host Associated parameter type: same as for the indices of the sparse matrix of the system |
|
Positive and negative indices of inertia for the system matrix Direction: out Memory: host Associated parameter type: same as for the indices of the sparse matrix of the system |
|
Row permutation Direction: out Memory: host or device Associated parameter type: same as for the indices of the sparse matrix of the system |
|
Column permutation Direction: out Memory: host or device Associated parameter type: same as for the indices of the sparse matrix of the system |
|
Final row permutation P (includes effects of both reordering and pivoting) which is applied to the original right-hand side of the system in the form \(b_{new} = b_{old} \circ P\) Direction: out Memory: host or device Associated parameter type: same as for the indices of the sparse matrix of the system Currently supported only when |
|
Final column permutation Q (includes effects of both reordering and pivoting) which is applied to transform the solution of the permuted system into the original solution \(x_{old} = x_{new} \circ Q^{-1}\) Direction: out Memory: host or device Associated parameter type: same as for the indices of the sparse matrix of the system Currently supported only when |
|
Diagonal of the factorized matrix Direction: out Memory: host or device Associated parameter type: same as for the values of the sparse matrix of the system Currently supported only when |
|
User permutation to be used instead of running the reordering algorithms. Direction: in Memory: host or device Associated parameter type: same as for the indices of the sparse matrix of the system Currently not supported when |
|
Memory estimates (in bytes) for host and device memory required for the chosen memory mode. The chosen memory mode is defined as the memory mode detected during the last executed
analysis phase for the current Note: the returned memory estimate depends not just on the settings from
Values returned in the output array at position: 0 - permanent device memory 1 - peak device memory 2 - permanent host memory 3 - peak host memory 4 - (if in hybrid memory mode) minimum device memory for the hybrid memory mode 5 - (if in hybrid memory mode) maximum host memory for the hybrid memory mode 6, … 15 - reserved for future use This query must be done after the analysis phase and will return status
Direction: out Memory: host Associated parameter type: |
|
Minimal amount of device memory (number of bytes) required in the hybrid memory mode. This query must be done after the analysis phase and will return status
Direction: out Memory: host Associated parameter type: For more details regarding the hybrid memory mode, see Hybrid mode feature. |
|
Communicator for MGMN mode. The actual type of the communicator must match the communication layer which must be set via calling cudssSetCommLayer() for the cuDSS library handle via cudssSetCommLayer(). Direction: in Memory: host Associated parameter type: For more details regarding the MGMN mode, see MGMN mode. |
Note: In case of batchCount > 1 (see cudssMatrixCreateBatchDn() and cudssMatrixCreateBatchCsr()) forCUDSS_DATA_LU_NNZ
,CUDSS_DATA_LU_NNZ
,CUDSS_DATA_INERTIA
,CUDSS_DATA_DIAG
cudssDataGet() returns accumulated number over all matrices in the batch.Note: In case of batchCount > 1 (see cudssMatrixCreateBatchDn() and cudssMatrixCreateBatchCsr())CUDSS_DATA_USER_PERM
,CUDSS_DATA_PERM_COL
,CUDSS_DATA_PERM_ROW
,CUDSS_DATA_PERM_REORDER_COL
,CUDSS_DATA_PERM_REORDER_ROW
are not supported
cudssPhase_t
#
The enumerator specifies solver phase to be performed in the main cuDSS routine cudssExecute().
Value |
Description |
---|---|
|
Reordering and symbolic factorization |
|
Numerical factorization |
|
Numerical re-factorization. Note: For now it is only used if reordering algorithm is set to |
|
Full solving phase (forward substitution + diagonal solve + backward substitution) and (optional) iterative refinement Note: If a new sparse matrix is given as input for this phase, it would be used for computing the residual (and thus the solver can be a part of LU-based preconditioner) |
|
Forward substitution sub-step of the solving phase Currently not supported. |
|
Diagonal solve sub-step of the solving phase Currently not supported. |
|
Backward substitution sub-step of the solving phase Currently not supported. |
Note: in the future, it might become possible to combine different phases, e.g. to call cudssExecute() withCUDSS_PHASE_FACTORIZATION | CUDSS_PHASE_SOLVE
and benefit from extra optimization. Currently such usage mode is not supported.
cudssMatrixFormat_t
#
The enumerator specifies the underlying matrix formats inside a cuDSS matrix object.
Value |
Description |
---|---|
|
Matrix is dense (applied to a single matrix and a batch equally) |
|
Matrix is in CSR format (applies to a single matrix and a batch equally) Note: Only 3-array CSR is supported. |
|
Matrix object represents a batch of matrices |
Note: The format flags can be combined. E.g., creating acudssMatrix_t
via cudssMatrixCreateBatchCsr() would set bothCUDSS_MFORMAT_CSR
andCUDSS_MFORMAT_BATCH
flags. One can check for a mixture of flags via bit-wise operations, e.g.CUDSS_MFORMAT_CSR | CUDSS_MFORMAT_BATCH
for a batch of CSR matrices.
cudssMatrixType_t
#
The enumerator specifies available matrix types for sparse matrices. Matrix type should be used to describe the properties of the underlying matrix storage. Matrix type affects the decision about what type of factorization is computed by the solver. E.g, when matrix type is one of the positive-definite types, checks for singular values on the diagonal is not done.
Value |
Description |
---|---|
|
General matrix [default]
|
|
Real symmetric matrix.
|
|
Complex Hermitian matrix.
|
|
Symmetric positive-definite matrix Cholesky factorization will be computed with optional local pivoting Note: if the matrix passed with this matrix type appears to have zero minors (at least
numerically), one can get the 1-based index of the first encountered
non-positive minor by calling
cudssDataGet() with |
|
Hermitian positive-definite matrix Complex Cholesky factorization will be computed with optional local pivoting Note: if the matrix passed with this matrix type appears to have zero minors (at least
numerically), one can get the 1-based index of the first non-positive minor by calling
cudssDataGet() with |
cudssMatrixViewType_t
#
The enumerator specifies available matrix view types for sparse matrices. Matrix view defines how the matrix is treated by the main cuDSS routine cudssExecute(). E.g., to provide only upper-triangle data for a symmetric matrix one can use as CUDSS_MTYPE_SYMMETRIC as matrix type combined withCUDSS_MVIEW_UPPER
as the matrix view. If the accompanying matrix type isCUDSS_MTYPE_GENERAL
, the matrix view is ignored.
Value |
Description |
---|---|
|
Full matrix [default] |
|
Lower-triangular matrix (including the diagonal) All values above the main diagonal will be ignored. |
|
Upper-triangular matrix (including the diagonal) All values below the main diagonal will be ignored. |
cudssIndexBase_t
#
The enumerator specifies indexing base (0 or 1) for sparse matrix indices (row start/end offsets and column indices). Once set for a sparse matrix, cudssExecute() will use the indexing base from the input sparse matrix for all index-related data (e.g. output from cudssDataGet() called withCUDSS_DATA_PERM_REORDER
).
Value |
Description |
---|---|
|
Zero-based indexing [default] |
|
One-based indexing |
cudssLayout_t
#
The enumerator specifies dense matrix layout.
Value |
Description |
---|---|
|
Column-major layout [default] |
|
Row-major layout. Currently not supported. |
cudssAlgType_t
#
The enumerator specifies algorithm choices to be made for the solver.
Value |
Description |
---|---|
|
Default value [default] For reordering, this option is a customized nested dissection algorithm based on METIS. For factorization, this option chooses the best fitting factorization algorithm based on the choice of the reordering algorithm and sparsity structure produced by it. |
|
First algorithm For reordering, this option is a custom combination of block triangular reordering and
COLAMD algorithms which can be used together with global pivoting to increase solution
accuracy for non-symmetric matrices where |
|
Second algorithm For reordering, this option is similar to |
|
Third algorithm For reordering, this option is approximate minimum degree (AMD) reordering. |
Different values represent different algorithms (for reordering, factorization, etc.) and can lead to significant differences in accuracy and performance. It is currently recommended to useCUDSS_ALG_DEFAULT
and only in case accuracy or performance are not sufficient, one can experiment with other values.
cudssPivotType_t
#
The enumerator specifies type of pivoting to be performed.
Value |
Description |
---|---|
|
Column-based pivoting [default] |
|
Row-based pivoting |
|
No pivoting |
Communication Layer (Distributed Interface) Types#
cudssDistributedInterface_t
#
This struct defines all communication primitives which need to be implementedin (any) implementation of cuDSS communication layer, see for more detailsNote: all communication layer API functions below take an argument of typevoid *
forcomm
.This parameter should be interpreted in the implementation based on the underlying communicationbackend to be used with the particular communication layer. E.g., for OpenMPI,comm
should betreated as the OpenMPI communicator.Note: most of the APIs below take an argument calledstream
of typecudaStream_t
and must be stream-ordered. For GPU-aware communication backends like OpenMPI,this implies the need to do explicitcudaStreamSynchronize()
in the communication layer implementation.
Members:
int (*cudssCommRank)(void *comm, int *rank)
A function pointer to a routine which returns the rank of the process in a communicator.
Parameters
In/Out
Description
comm
In
A pointer to the communicator.
rank
Out
Rank of the calling process in the communicator.
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.int (*cudssCommSize)(void *comm, int *size)
A function pointer to a routine which returns number of processes in a communicator.
Parameters
In/Out
Description
comm
In
A pointer to the communicator.
size
Out
Number of processes in the communicator.
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.int (*cudssSend)(const void *buffer, int count, cudaDataType_t datatype, int dest, int tag, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a blocking send.
Parameters
In/Out
Description
buffer
In
Initial address of the send device buffer.
count
In
Number of elements (of type
datatype
) to be sent.
datatype
In
CUDA datatype of elements to be sent.
dest
In
Rank of the receiving process (destination).
tag
In
Message tag.
comm
In
A pointer to the communicator.
stream
In
CUDA stream.
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.int (*cudssRecv)(void *buffer, int count, cudaDataType_t datatype, int source, int tag, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a blocking receive for a message.
Parameters
In/Out
Description
buffer
Out
Initial address of the receive device buffer.
count
In
Number of elements (of type
datatype
) to be received.
datatype
In
CUDA datatype of elements to be received.
source
In
Rank of the sending process (source).
tag
In
Message tag.
comm
In
A pointer to the communicator.
stream
In
CUDA stream.
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.int (*cudssBcast)(void *buffer, int count, cudaDataType_t datatype, int root, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a broadcast for a message from theroot
processto all other processes of the communicator.
Parameters
In/Out
Description
buffer
In/Out
Address of the device buffer to be broadcast.
count
In
Number of elements (of type
datatype
) to be received.
datatype
In
CUDA datatype of elements to be received.
root
In
Rank of the sending process (source).
comm
In
A pointer to the communicator.
stream
In
CUDA stream.
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.int (*cudssReduce)(const void *sendbuf, void *recvbuf, int count, cudaDataType_t datatype, cudssOpType_t op, int root, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a reduction of values on all processes to a single value ontheroot
process.
Parameters
In/Out
Description
sendbuf
In
Address of the send buffer
recvbuf
Out
Address of the receive buffer
count
In
Number of elements (of type
datatype
) to be received.
datatype
In
CUDA datatype of elements to be received.
op
In
Type of the reduction operation to be performed, see cudssOpType_t for supported values.
root
In
Rank of the root process (source).
comm
In
A pointer to the communicator.
stream
In
CUDA stream.
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.int (*cudssAllreduce)(const void *sendbuf, void *recvbuf, int count, cudaDataType_t datatype, cudssOpType_t op, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a reduction of values on all processes to a single value anddistributes the result back to all processes.
Parameters
In/Out
Description
sendbuf
In
Address of the send buffer.
recvbuf
Out
Address of the receive buffer.
count
In
Number of elements (of type
datatype
) to be received.
datatype
In
CUDA datatype of elements to be received.
op
In
Type of the reduction operation to be performed, see cudssOpType_t for supported values.
comm
In
A pointer to the communicator.
stream
In
CUDA stream.
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.int (*cudssScatterv)(const void *sendbuf, const int *sendcounts, const int *displs, cudaDataType_t sendtype, void *recvbuf, int recvcount, cudaDataType_t recvtype, int root, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a scatter operation on a buffer in parts to all processes in a communicator.
Parameters
In/Out
Description
sendbuf
In
Address of the send buffer.
sendcounts
In
Non-negative integer array (of length communicator size) specifying the number of elements to send to each rank.
displs
In
An array of integers of length communicator size. Entry i specifies the displacement (relative to
sendbuf
) from which to take the outgoing data to process i.
sendtype
In
CUDA datatype of elements to be sent.
recvbuf
Out
Address of the receive buffer.
recvcount
In
Number of elements in receive buffer (non-negative integer).
recvtype
In
CUDA datatype of elements to be received.
root
In
Rank of the sending process (source).
comm
In
A pointer to the communicator.
stream
In
CUDA stream.
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.int (*cudssCommSplit)(const void *comm, int color, int key, void *new_comm)
A function pointer to a routine which creates new communicators based on colors and keys.
Parameters
In/Out
Description
comm
In
A pointer to the communicator to be split.
color
In
Control of the subset assignment. Processes with the same color are grouped together.
key
In
Control of the rank assignment. Processes in the new communicator are ordered based on the keys.
new_comm
Out
A pointer to the new communicator defined w.r.t to colors.
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.int (*cudssCommFree)(void *comm)
A function pointer to a routine which deallocates resources of a communicator.
Parameters
In/Out
Description
comm
In/Out
A pointer to the communicator to be freed.
Returns error status (as
int
) of the invocation. Must return 0 on success and any nonzero integer otherwise.
cudssOpType_t
#
The enumerator specifies reduction operation to be used when callingcommunication layer APIscudssReduce()
orcudssAllreduce()
.
Value |
Description |
---|---|
|
Reduced elements are added together. |
|
Maximum element is found among the reduced elements. |
|
Minimum element is found among the reduced elements. |
Threading Layer Types#
cudssThreadingInterface_t
#
This struct defines all threading primitives which need to be implementedin (any) implementation of cuDSS threading layer, see for more detailsthreading layer and MT mode.
Members:
int (*cudssGetMaxThreads)()
A function pointer to a routine which returns (asint
) the maximum number of threads on the CPU that can be used by cuDSS for parallel execution.void (*cudssParallelFor)(int nthr_requested, int ntasks, void *ctx, cudss_thr_func_t f)
A function pointer to a routine which opens aparallel for
section with the requested number of threads and call cudss_thr_func_t (seecudss_threading_interface.h
for details)f
in the for loop with ntasks number of iterations.
Parameters
In/Out
Description
nthr_requested
In
Requested number of threads for the parallel section.
ntasks
In
Number of tasks in the parallel for loop.
ctx
In
A pointer to the input data for cudss_thr_func_t f
f
In
A pointer to the cudss_thr_func_t function to be called in the parallel for loop
The threading interface struct uses the followingtypedef
declaration of a task function which defines parallel units of work, also called tasks:typedef void (*cudss_thr_func_t)(int task, void *ctx)
Parameters
In/Out
Description
task
In
The task number (or iteration count of the parallel loop).
ctx
In
A pointer to the input data.