cuDSS Data Types

Opaque Data Structures

cudssHandle_t

The structure holds the cuDSS library context (device properties, system information, execution controls like cudaStream_t, etc.).
The handle must be initialized prior to calling any other cuDSS API with cudssCreate(). The handle must be destroyed to free up resources after using cuDSS with cudssDestroy().

cudssMatrix_t

The structure is a lightweight wrapper around standard dense/sparse matrix parameters and does not own any data arrays. Matrix objects are used to pass matrix of the linear system, as well as solution and right-hand side (even if these are in fact vectors). Currently, cuDSS matrix objects can have either one of the two underlying matrix formats: dense and 3-array CSR (sparse).
Matrix objects should be created via cudssMatrixCreateDn() (for dense matrices) or cudssMatrixCreateCsr() (for sparse matrices in CSR format). After use, matrix objects should be destroyed via cudssMatrixDestroy().
Matrix objects can be modified after creation via cudssMatrixSetValues() and cudssMatrixSetCsrPointers().
Information can be retrieved from a matrix object by calling cudssMatrixGetFormat() followed by either cudssMatrixGetDn() or cudssMatrixGetCsr() depending on the format returned.

cudssData_t

The structure holds internal data (e.g., factors related data structures) as well as pointers to user-provided data. A single object of this type should be associated with solving a specific linear system. If multiple systems with the same datatype(!) are solved consecutively the object can be re-used (all necessary internal buffers will be re-created per necessity).
Note: by default, the library allocates device memory required for performing LU factorization and storing the LU factors internally. All data buffers are of this kind are kept inside the data object. To change this default behavior, one can set a cudssDeviceMemHandler_t which will then be used for allocating device memory inside the solver.
The object should be created via cudssDataCreate() and destroyed via cudssDataDestroy().
During execution of any of the stages cudssExecute(), configuration settings of the solver are read from cudssConfig_t and thus affect the execution and internal data stored in the cudssData_t object.
Data parameters can be updated or retrieved by calling cudssDataSet() or cudssDataGet() respectively.

cudssConfig_t

The structure stores configuration settings for the solver. This object is a lightweight (host-side) wrapper around common solver settings. While it can be re-used for solving different linear systems, it is recommended to have one per linear system.
The object should be created via cudssConfigCreate() and destroyed via cudssConfigDestroy().
During execution of any of the stages cudssExecute(), configuration settings of the solver are read from cudssConfig_t and thus affect the execution.
Configuration settings can be updated or retrieved by calling cudssConfigSet() or cudssConfigGet() respectively. Note: certain settings need to be set before a corresponding solver stage is executed (e.g., reordering algorithm must be set prior to the phase CUDSS_PHASE_ANALYSIS).

Non-opaque Data Structures

cudssDeviceMemHandler_t

This structure holds information about the user-provided, stream-ordered device memory pool (mempool).
The object can be created by setting the struct members described below.
Once created, a device memory handler can be set for the cuDSS library handle via cudssSetDeviceMemhandler().
Once set for the cuDSS library handle, information about the set device memory handler can be retrieved via cudssGetDeviceMemhandler().

Members:

void *ctx
A pointer to the user-owned mempool/context object.
int (*device_alloc)(void *ctx, void **ptr, size_t size, cudaStream_t stream)
A function pointer to the user-provided routine for allocating device memory of size on stream.
The allocated memory should be made accessible to the current device (or more precisely, to the current CUDA context bound to the library handle).
This interface supports any stream-ordered memory allocator ctx. Upon success, the allocated memory can be immediately used on the given stream by any operations enqueued/ordered on the same stream after this call.
It is the caller’s responsibility to ensure a proper stream order is established.
The allocated memory should be at least 256-byte aligned.

Parameters

In/Out

Description

ctx

In

A pointer to the user-owned mempool object.

ptr

Out

On success, a pointer to the allocated buffer.

size

In

The amount of memory in bytes to be allocated.

stream

In

The CUDA stream on which the memory is allocated (and the stream order is established)

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int (*device_free)(void *ctx, void *ptr, size_t size, cudaStream_t stream)
A function pointer to the user-provided routine for deallocating device memory of size on stream.
This interface supports any stream-ordered memory allocator. Upon success, any subsequent accesses (of the memory pointed to by the pointer ptr) ordered after this call are undefined behaviors.
It is the caller’s responsibility to ensure a proper stream order is established.
If the arguments ctx and size are not the same as those passed to device_alloc to allocate the memory pointed to by ptr, the behavior is undefined.
The argument stream need not be identical to the one used for allocating ptr, as long as the stream order is correctly established. The behavior is undefined if this assumption is not held.

Parameters

In/Out

Description

ctx

IN

A pointer to the user-owned mempool object.

ptr

IN

The pointer to the allocated buffer.

size

IN

The size of the allocated memory.

stream

IN

The CUDA stream on which the memory is deallocated (and the stream order is established).

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

char name[CUDSS_ALLOCATOR_NAME_LEN]
The name of the provided mempool (must not exceed 64 characters).

Enumerators

cudssStatus_t

The enumerator specifies possible status values (on the host) which can be returned from calls to cuDSS routines.
Note: device side failures are returned via CUDSS_DATA_INFO from cudssDataParam_t.

Value

Description

CUDSS_STATUS_SUCCESS

The operation completed successfully.

CUDSS_STATUS_NOT_INITIALIZED

One of the input operands was not properly initialized prior to the call to a cuDSS routine. This can usually be one of the opaque objects like cudssHandle_t, cudssData_t or others.

CUDSS_STATUS_ALLOC_FAILED

Resource allocation failed inside the cuDSS library. This is usually caused by a device memory allocation (cudaMalloc()) or by a host memory allocation failure

CUDSS_STATUS_INVALID_VALUE

An incorrect value or parameter was passed to the function (a negative vector size, or a a NULL pointer for a must-have buffer,for example)

CUDSS_STATUS_NOT_SUPPORTED

An unsupported (but otherwise reasonable parameter was passed to the function

CUDSS_STATUS_EXECUTION_FAILED

The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons

CUDSS_STATUS_INTERNAL_ERROR

An internal cuDSS operation failed

cudssConfigParam_t

The enumerator specifies possible names of solver configuration settings. For each setting there is a matching type to be used in cudssConfigSet() or cudssConfigGet().

Value

Description

CUDSS_CONFIG_REORDERING_ALG

Algorithm for the reordering phase

Associated parameter type: cudssAlgType_t

Note: CUDSS_ALG_1 and CUDSS_ALG_2 are only supported for general (non-symmetric or non-hermitian) matrices.

Note: CUDSS_ALG_1 uses an upper bound on the number of non-zero entries in the factors factors. If this bound appears to be not sufficient during the factorization phase, a runtime device error is returned (which can be checked by synchronizing the stream and calling cudssDataGet() with CUDSS_DATA_INFO which will output the device error). In order to set a non-default upper bound, one should call cudssConfigSet() with CUDSS_CONFIG_MAX_LU_NNZ setting.

Default value: CUDSS_ALG_DEFAULT

CUDSS_CONFIG_FACTORIZATION_ALG

Algorithm for the factorization phase

Associated parameter type: cudssAlgType_t

Default value: CUDSS_ALG_DEFAULT

CUDSS_CONFIG_SOLVE_ALG

Algorithm for the solving phase

Associated parameter type: cudssAlgType_t

Default value: CUDSS_ALG_DEFAULT

CUDSS_CONFIG_MATCHING_TYPE

Type of matching (on/off)

Associated parameter type: int

Default value: 0. Other values are not supported.

CUDSS_CONFIG_SOLVE_MODE

Potential modificator on the system matrix (e.g. transpose or conjugate transpose)

Associated parameter type: int

Default value: 0 (no modificator). Other values are not supported.

CUDSS_CONFIG_IR_N_STEPS

Number of steps during the iterative refinement

Associated parameter type: int

Default value: 0

CUDSS_CONFIG_IR_N_TOL

Iterative refinement tolerance

Associated parameter type: double

Currently it is ignored (exactly CUDSS_CONFIG_IR_N_STEPS steps are made)

CUDSS_CONFIG_PIVOT_TYPE

Type of pivoting

Associated parameter type: cudssPivotType_t

Default value: CUDSS_PIVOT_COL.

CUDSS_CONFIG_PIVOT_THRESHOLD

Pivoting threshold \(p_{threshold}\) which is used to determine if diagonal element is subject to pivoting and will be swapped with the maximum element in the row (or column) depending on the type of pivoting.

The diagonal element will be swapped if: \(p_{threshold} \cdot max_{(sub)row \, or \, col} |a_{ij}| \geq |a_{ii}|\)

Associated parameter type: double

Default value: 1.0f.

CUDSS_CONFIG_PIVOT_EPSILON

Pivoting epsilon, absolute value to replace singular diagonal elements

Associated parameter type: double

Default value: 1e-5 for single precision and 1e-13 for double precision.

CUDSS_CONFIG_MAX_LU_NNZ

Upper limit on the number of nonzero entries in LU factors. This is only relevant for non-symmetric matrices and reordering algorithm set to CUDSS_ALG_1 or CUDSS_ALG_2. If the number of non-zero entries in L and U exceeds the set limit, a runtime error happen. See also the note for CUDSS_ALG_1 in the table entry for CUDSS_CONFIG_REORDERING_ALG.

Associated parameter type: int64_t

Default value: -1 (then the value is ignored).

CUDSS_CONFIG_HYBRID_MODE

Memory mode: 0 (default = device-only) or 1 (hybrid = host/device).

Note: Hybrid memory mode should be enabled before the analysis phase (cudssExecute() with CUDSS_PHASE_ANALYSIS). If the decision to use the hybrid mode is done after the analysis phase, the hybrid memory mode should be enabled and analysis phase must be re-done (which is sub-optimal).

For more details regarding the hybrid memory mode, see Hybrid mode feature.

Associated parameter type: int

Default value: 0 (disabled). Currently not supported when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering, or, when CUDSS_ALG_1 is used for the factorization.

CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT

User-defined device memory limit (number of bytes) for the hybrid memory mode.

This setting only affects execution when the hybrid memory mode is enabled.

For more details regarding the hybrid memory mode, see Hybrid mode feature.

Associated parameter type: int64_t

Default value: -1 (use the internal default heuristic).

CUDSS_CONFIG_USE_CUDA_REGISTER_MEMORY

A flag to enable or disable usage of cudaHostRegister() by cuDSS hybrid memory mode.

Since the hybrid memory mode of cuDSS uses host memory to store the factors, it can use cudaHostRegister() (if the HW supports it) to speedup associated host-to-device and device-to-host memory transfers. However, registering host memory has limitations and in some cases might lead to slowdowns. If the flag is not set to 0, cuDSS will use cudaHostRegister() whenever the HW supports it. If the flag is set to 0, cuDSS will not attempt to use cudaHostRegister() even if the HW supports it.

This setting only affects execution when the hybrid memory mode is enabled.

For more details regarding the hybrid memory mode, see Hybrid mode feature.

Associated parameter type: int

Default value: 1 (use cudaHostRegister() if the device supports it)


cudssDataParam_t

The enumerator specifies possible parameter names which can set or get in the cudssData_t object. For each parameter name there is an associated type to be used in cudssDataSet() or cudssDataGet(). Each parameter name is marked with “in”, “out” or “inout” depending on whether a parameter can be only set, get or be involved in both.

Value

Description

CUDSS_DATA_INFO

Device-side error information.

Direction: out

Memory: host

Associated parameter type: int

CUDSS_DATA_LU_NNZ

Number of non-zero entries in LU factors.

Direction: out

Memory: host

Associated parameter type: int64_t

CUDSS_DATA_NPIVOTS

Number of pivots encountered during factorization.

Direction: out

Memory: host

Associated parameter type: same as for the indices of the sparse matrix of the system

CUDSS_DATA_INERTIA

Positive and negative indices of inertia for the system matrix A (two integer values). Valid only for symmetric/Hermitian non positive-definite matrix types.

Direction: out

Memory: host

Associated parameter type: same as for the indices of the sparse matrix of the system

CUDSS_DATA_PERM_REORDER_ROW

Row permutation P after reordering such that A[P,Q] is factorized.

Direction: out

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

CUDSS_DATA_PERM_REORDER_COL

Column permutation Q after reordering such that A[P,Q] is factorized.

Direction: out

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

CUDSS_DATA_PERM_ROW

Final row permutation P (includes effects of both reordering and pivoting) which is applied to the original right-hand side of the system in the form \(b_{new} = b_{old} \circ P\)

Direction: out

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

Currently supported only when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.

CUDSS_DATA_PERM_COL

Final column permutation Q (includes effects of both reordering and pivoting) which is applied to transform the solution of the permuted system into the original solution \(x_{old} = x_{new} \circ Q^{-1}\)

Direction: out

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

Currently supported only when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.

CUDSS_DATA_DIAG

Diagonal of the factorized matrix

Direction: out

Memory: host or device

Associated parameter type: same as for the values of the sparse matrix of the system

Currently supported only when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.

CUDSS_DATA_USER_PERM

User permutation to be used instead of running the reordering algorithms.

Direction: in

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

Currently not supported when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.

CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN

Minimal amount of device memory (number of bytes) required in the hybrid memory mode.

This query must be done after the analysis phase and will return status CUDSS_STATUS_NOT_SUPPORTED if it cannot be processed.

Direction: out

Memory: host or device

Associated parameter type: int64_t

For more details regarding the hybrid memory mode, see Hybrid mode feature.

CUDSS_DATA_COMM

Communicator for MGMN mode.

The actual type of the communicator must match the communication layer which must be set via calling cudssSetCommLayer() for the cuDSS library handle via cudssSetCommLayer().

Direction: in

Memory: host

Associated parameter type: void*

For more details regarding the MGMN mode, see MGMN mode.


cudssPhase_t

The enumerator specifies solver phase to be performed in the main cuDSS routine cudssExecute().

Value

Description

CUDSS_PHASE_ANALYSIS

Reordering and symbolic factorization

CUDSS_PHASE_FACTORIZATION

Numerical factorization

CUDSS_PHASE_REFACTORIZATION

Numerical re-factorization.

Note: For now it is only used if reordering algorithm is set to CUDSS_ALG_1 or CUDSS_ALG_2. Otherwise it is the same as FACTORIZATION phase.

CUDSS_PHASE_SOLVE

Full solving phase (forward substitution + diagonal solve + backward substitution) and (optional) iterative refinement

Note: If a new sparse matrix is given as input for this phase, it would be used for computing the residual (and thus the solver can be a part of LU-based preconditioner)

CUDSS_PHASE_SOLVE_FWD

Forward substitution sub-step of the solving phase

Currently not supported.

CUDSS_PHASE_SOLVE_DIAG

Diagonal solve sub-step of the solving phase

Currently not supported.

CUDSS_PHASE_SOLVE_BWD

Backward substitution sub-step of the solving phase

Currently not supported.

Note: in the future, it might become possible to combine different phases, e.g. to call cudssExecute() with CUDSS_PHASE_FACTORIZATION | CUDSS_PHASE_SOLVE and benefit from extra optimization. Currently such usage mode is not supported.

cudssMatrixFormat_t

The enumerator specifies the underlying data format inside a cuDSS matrix object.

Value

Description

CUDSS_MFORMAT_DENSE

Dense matrix format

CUDSS_MFORMAT_CSR

CSR sparse matrix format.

Note: Only 3-array CSR is supported.


cudssMatrixType_t

The enumerator specifies available matrix types for sparse matrices. Matrix type should be used to describe the properties of the underlying matrix storage. Matrix type affects the decision about what type of factorization is computed by the solver. E.g, when matrix type is one of the positive-definite types, checks for singular values on the diagonal is not done.

Value

Description

CUDSS_MTYPE_GENERAL

General matrix [default] LDU factorization will be computed with optional local or global pivoting

CUDSS_MTYPE_SYMMETRIC

Real symmetric matrix. LDL^T factorization will be computed with optional local pivoting

CUDSS_MTYPE_HERMITIAN

Complex Hermitian matrix. LDL^H factorization will be computed with optional local pivoting

CUDSS_MTYPE_SPD

Symmetric positive-definite matrix Cholesky factorization will be computed with optional local pivoting

CUDSS_MTYPE_HPD

Hermitian positive-definite matrix Complex Cholesky factorization will be computed with optional local pivoting


cudssMatrixViewType_t

The enumerator specifies available matrix view types for sparse matrices. Matrix view defines how the matrix is treated by the main cuDSS routine cudssExecute(). E.g., to provide only upper-triangle data for a symmetric matrix one can use as CUDSS_MTYPE_SYMMETRIC as matrix type combined with CUDSS_MVIEW_UPPER as the matrix view. If the accompanying matrix type is CUDSS_MTYPE_GENERAL, the matrix view is ignored.

Value

Description

CUDSS_MVIEW_FULL

Full matrix [default]

CUDSS_MVIEW_LOWER

Lower-triangular matrix (including the diagonal) All values above the main diagonal will be ignored.

CUDSS_MVIEW_UPPER

Upper-triangular matrix (including the diagonal) All values below the main diagonal will be ignored.


cudssIndexBase_t

The enumerator specifies indexing base (0 or 1) for sparse matrix indices (row start/end offsets and column indices). Once set for a sparse matrix, cudssExecute() will use the indexing base from the input sparse matrix for all index-related data (e.g. output from cudssDataGet() called with CUDSS_DATA_PERM_REORDER).

Value

Description

CUDSS_BASE_ZERO

Zero-based indexing [default]

CUDSS_BASE_ONE

One-based indexing


cudssLayout_t

The enumerator specifies dense matrix layout.

Value

Description

CUDSS_LAYOUT_COL_MAJOR

Column-major layout [default]

CUDSS_LAYOUT_ROW_MAJOR

Row-major layout.

Currently not supported.


cudssAlgType_t

The enumerator specifies algorithm choices to be made for the solver.

Value

Description

CUDSS_ALG_DEFAULT

Default value [default]

For reordering, this option is a customized nested dissection algorithm based on METIS.

For factorization, this option chooses the best fitting factorization algorithm based on the choice of the reordering algorithm and sparsity structure produced by it.

CUDSS_ALG_1

First algorithm

For reordering, this option is a custom combination of block triangular reordering and COLAMD algorithms which can be used together with global pivoting to increase solution accuracy for non-symmetric matrices where CUDSS_ALG_DEFAULT produces too many perturbed pivots. When this option is used for reordering, cuDSS uses an appropriate custom factorization algorithm (without the need to change the factorization setting CUDSS_CONFIG_FACTORIZATION_ALG.

CUDSS_ALG_2

Second algorithm

For reordering, this option is similar to CUDSS_ALG_1 that it implies using a special factorization algorithm which is tailored for a block-triangular representation but, unlike CUDSS_ALG_1, this option uses a trivial block structure.

CUDSS_ALG_3

Third algorithm

For reordering, this option is approximate minimum degree (AMD) reordering.

Different values represent different algorithms (for reordering, factorization, etc.) and can lead to significant differences in accuracy and performance. It is currently recommended to use CUDSS_ALG_DEFAULT and only in case accuracy or performance are not sufficient, one can experiment with other values.

cudssPivotType_t

The enumerator specifies type of pivoting to be performed.

Value

Description

CUDSS_PIVOT_COL

Column-based pivoting [default]

CUDSS_PIVOT_ROW

Row-based pivoting

CUDSS_PIVOT_NONE

No pivoting


Communication Layer (Distributed Interface) Types

cudssDistributedInterface_t

This struct defines all communication primitives which need to be implemented
in (any) implementation of cuDSS communication layer, see for more details
Note: all communication layer API functions below take an argument of type void * for comm.
This parameter should be interpreted in the implementation based on the underlying communication
backend to be used with the particular communication layer. E.g., for OpenMPI, comm should be
treated as the OpenMPI communicator.
Note: most of the APIs below take an argument called stream of type
cudaStream_t and must be stream-ordered. For GPU-aware communication backends like OpenMPI,
this implies the need to do explicit cudaStreamSynchronize() in the communication layer implementation.

Members:

int (*cudssCommRank)(void *comm, int *rank)
A function pointer to a routine which returns the rank of the process in a communicator.

Parameters

In/Out

Description

comm

In

A pointer to the communicator.

rank

Out

Rank of the calling process in the communicator.

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int (*cudssCommSize)(void *comm, int *size)
A function pointer to a routine which returns number of processes in a communicator.

Parameters

In/Out

Description

comm

In

A pointer to the communicator.

size

Out

Number of processes in the communicator.

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int (*cudssSend)(const void *buffer, int count, cudaDataType_t datatype, int dest, int tag, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a blocking send.

Parameters

In/Out

Description

buffer

In

Initial address of the send device buffer.

count

In

Number of elements (of type datatype) to be sent.

datatype

In

CUDA datatype of elements to be sent.

dest

In

Rank of the receiving process (destination).

tag

In

Message tag.

comm

In

A pointer to the communicator.

stream

In

CUDA stream.

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int (*cudssRecv)(void *buffer, int count, cudaDataType_t datatype, int source, int tag, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a blocking receive for a message.

Parameters

In/Out

Description

buffer

Out

Initial address of the receive device buffer.

count

In

Number of elements (of type datatype) to be received.

datatype

In

CUDA datatype of elements to be received.

source

In

Rank of the sending process (source).

tag

In

Message tag.

comm

In

A pointer to the communicator.

stream

In

CUDA stream.

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int (*cudssBcast)(void *buffer, int count, cudaDataType_t datatype, int root, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a broadcast for a message from the root process
to all other processes of the communicator.

Parameters

In/Out

Description

buffer

In/Out

Address of the device buffer to be broadcast.

count

In

Number of elements (of type datatype) to be received.

datatype

In

CUDA datatype of elements to be received.

root

In

Rank of the sending process (source).

comm

In

A pointer to the communicator.

stream

In

CUDA stream.

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int (*cudssReduce)(const void *sendbuf, void *recvbuf, int count, cudaDataType_t datatype, cudssOpType_t op, int root, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a reduction of values on all processes to a single value on
the root process.

Parameters

In/Out

Description

sendbuf

In

Address of the send buffer

recvbuf

Out

Address of the receive buffer

count

In

Number of elements (of type datatype) to be received.

datatype

In

CUDA datatype of elements to be received.

op

In

Type of the reduction operation to be performed, see cudssOpType_t for supported values.

root

In

Rank of the root process (source).

comm

In

A pointer to the communicator.

stream

In

CUDA stream.

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int (*cudssAllreduce)(const void *sendbuf, void *recvbuf, int count, cudaDataType_t datatype, cudssOpType_t op, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a reduction of values on all processes to a single value and
distributes the result back to all processes.

Parameters

In/Out

Description

sendbuf

In

Address of the send buffer.

recvbuf

Out

Address of the receive buffer.

count

In

Number of elements (of type datatype) to be received.

datatype

In

CUDA datatype of elements to be received.

op

In

Type of the reduction operation to be performed, see cudssOpType_t for supported values.

comm

In

A pointer to the communicator.

stream

In

CUDA stream.

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int (*cudssScatterv)(const void *sendbuf, const int *sendcounts, const int *displs, cudaDataType_t sendtype, void *recvbuf, int recvcount, cudaDataType_t recvtype, int root, void *comm, cudaStream_t stream)
A function pointer to a routine which performs a scatter operation on a buffer in parts to all processes in a communicator.

Parameters

In/Out

Description

sendbuf

In

Address of the send buffer.

sendcounts

In

Non-negative integer array (of length communicator size) specifying the number of elements to send to each rank.

displs

In

An array of integers of length communicator size. Entry i specifies the displacement (relative to sendbuf) from which to take the outgoing data to process i.

sendtype

In

CUDA datatype of elements to be sent.

recvbuf

Out

Address of the receive buffer.

recvcount

In

Number of elements in receive buffer (non-negative integer).

recvtype

In

CUDA datatype of elements to be received.

root

In

Rank of the sending process (source).

comm

In

A pointer to the communicator.

stream

In

CUDA stream.

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int (*cudssCommSplit)(const void *comm, int color, int key, void *new_comm)
A function pointer to a routine which creates new communicators based on colors and keys.

Parameters

In/Out

Description

comm

In

A pointer to the communicator to be split.

color

In

Control of the subset assignment. Processes with the same color are grouped together.

key

In

Control of the rank assignment. Processes in the new communicator are ordered based on the keys.

new_comm

Out

A pointer to the new communicator defined w.r.t to colors.

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int (*cudssCommFree)(void *comm)
A function pointer to a routine which deallocates resources of a communicator.

Parameters

In/Out

Description

comm

In/Out

A pointer to the communicator to be freed.

Returns error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.


cudssOpType_t

The enumerator specifies reduction operation to be used when calling
communication layer APIs cudssReduce() or cudssAllreduce().

Value

Description

CUDSS_SUM

Reduced elements are added together.

CUDSS_MAX

Maximum element is found among the reduced elements.

CUDSS_MIN

Minimum element is found among the reduced elements.