cuDSS Data Types#

Opaque Data Structures#

cudssHandle_t#

struct cudssHandle_t#
The structure holds the cuDSS library context (device properties, system information, execution controls like cudaStream_t, etc.).
The handle must be initialized prior to calling any other cuDSS API with cudssCreate() or cudssCreateMg(). The handle must be destroyed to free up resources after using cuDSS with cudssDestroy().

cudssMatrix_t#

struct cudssMatrix_t#
The structure is a lightweight wrapper around standard dense/sparse matrix parameters and does not own any data arrays. Matrix objects are used to pass matrix of the linear system, as well as solution and right-hand side (even if these are in fact vectors). Currently, cuDSS matrix objects can have either one of the two underlying matrix formats: dense and 3-array CSR (sparse). Additionally, they can represent non-uniform batches of matrices or distributed (in the MGMN mode).
Matrix objects should be created via cudssMatrixCreateDn() (for dense matrices) or cudssMatrixCreateBatchDn() (for a batch of dense matrices) or cudssMatrixCreateCsr() (for sparse matrices in CSR format) or cudssMatrixCreateBatchCsr() (for a batch of sparse matrices in CSR format). After use, matrix objects should be destroyed via cudssMatrixDestroy(). For distributed matrices, one can additionally call cudssMatrixSetDistributionRow1d() to define how the matrix is distributed.
Matrix objects can be modified after creation via cudssMatrixSetValues() and cudssMatrixSetCsrPointers() (and similar APIs for batches).
Information can be retrieved from a matrix object by calling cudssMatrixGetFormat() followed by either cudssMatrixGetDn() or cudssMatrixGetCsr() depending on the format returned.

cudssData_t#

struct cudssData_t#
The structure holds internal data (e.g., factors related data structures) as well as pointers to user-provided data. A single object of this type should be associated with solving a specific linear system. If multiple systems with the same datatype(!) are solved consecutively the object can be re-used (all necessary internal buffers will be re-created per necessity).
Note: by default, the library allocates device memory required for performing LU factorization and storing the LU factors internally. All data buffers are of this kind are kept inside the data object. To change this default behavior, one can set a cudssDeviceMemHandler_t which will then be used for allocating device memory inside the solver.
The object should be created via cudssDataCreate() and destroyed via cudssDataDestroy().
During execution of any of the stages cudssExecute(), configuration settings of the solver are read from cudssConfig_t and thus affect the execution and internal data stored in the cudssData_t object.
Data parameters can be updated or retrieved by calling cudssDataSet() or cudssDataGet() respectively.

cudssConfig_t#

struct cudssConfig_t#
The structure stores configuration settings for the solver. This object is a lightweight (host-side) wrapper around common solver settings. While it can be re-used for solving different linear systems, it is recommended to have one per linear system.
The object should be created via cudssConfigCreate() and destroyed via cudssConfigDestroy().
During execution of any of the stages cudssExecute(), configuration settings of the solver are read from cudssConfig_t and thus affect the execution.
Configuration settings can be updated or retrieved by calling cudssConfigSet() or cudssConfigGet() respectively. Note: certain settings need to be set before a corresponding solver stage is executed (e.g., reordering algorithm must be set prior to the phase CUDSS_PHASE_ANALYSIS).

Non-opaque Data Structures#

cudssDeviceMemHandler_t#

struct cudssDeviceMemHandler_t#
This structure holds information about the user-provided, stream-ordered device memory pool (mempool).
The object can be created by setting the struct members described below.
Once created, a device memory handler can be set for the cuDSS library handle via cudssSetDeviceMemhandler().
Once set for the cuDSS library handle, information about the set device memory handler can be retrieved via cudssGetDeviceMemhandler().
void *ctx#
A pointer to the user-owned mempool/context object.
int device_alloc(
void *ctx,
void **ptr,
size_t size,
cudaStream_t stream
)#
A function pointer to the user-provided routine for allocating device memory of size on stream.
The allocated memory should be made accessible to the current device (or more precisely, to the current CUDA context bound to the library handle).
This interface supports any stream-ordered memory allocator ctx. Upon success, the allocated memory can be immediately used on the given stream by any operations enqueued/ordered on the same stream after this call.
It is the caller’s responsibility to ensure a proper stream order is established.
The allocated memory should be at least 256-byte aligned.
Param ctx:

[in] A pointer to the user-owned mempool object.

Param ptr:

[out] On success, a pointer to the allocated buffer.

Param size:

[in] The amount of memory in bytes to be allocated.

Param stream:

[in] The CUDA stream on which the memory is allocated (and the stream order is established)

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int device_free(
void *ctx,
void *ptr,
size_t size,
cudaStream_t stream
)#
A function pointer to the user-provided routine for deallocating device memory of size on stream.
This interface supports any stream-ordered memory allocator. Upon success, any subsequent accesses (of the memory pointed to by the pointer ptr) ordered after this call are undefined behaviors.
It is the caller’s responsibility to ensure a proper stream order is established.
If the arguments ctx and size are not the same as those passed to device_alloc to allocate the memory pointed to by ptr, the behavior is undefined.
The argument stream need not be identical to the one used for allocating ptr, as long as the stream order is correctly established. The behavior is undefined if this assumption is not held.
Param ctx:

[in] A pointer to the user-owned mempool object.

Param ptr:

[in] The pointer to the allocated buffer.

Param size:

[in] The size of the allocated memory.

Param stream:

[in] The CUDA stream on which the memory is deallocated (and the stream order is established).

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

char name[CUDSS_ALLOCATOR_NAME_LEN]
The name of the provided mempool (must not exceed 64 characters).

Enumerators#

cudssStatus_t#

enum cudssStatus_t#
The enumerator specifies possible status values (on the host) which can be returned from calls to cuDSS routines.
Note: device side failures are returned via CUDSS_DATA_INFO from cudssDataParam_t.
enumerator CUDSS_STATUS_SUCCESS#

The operation completed successfully.

enumerator CUDSS_STATUS_NOT_INITIALIZED#

One of the input operands was not properly initialized prior to the call to a cuDSS routine. This can usually be one of the opaque objects like cudssHandle_t, cudssData_t or others.

enumerator CUDSS_STATUS_ALLOC_FAILED#

Resource allocation failed inside the cuDSS library. This is usually caused by a device memory allocation (cudaMalloc()) or by a host memory allocation failure

enumerator CUDSS_STATUS_INVALID_VALUE#

An incorrect value or parameter was passed to the function (a negative vector size, or a a NULL pointer for a must-have buffer,for example)

enumerator CUDSS_STATUS_NOT_SUPPORTED#

An unsupported (but otherwise reasonable parameter was passed to the function

enumerator CUDSS_STATUS_EXECUTION_FAILED#

The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons

enumerator CUDSS_STATUS_INTERNAL_ERROR#

An internal cuDSS operation failed

cudssConfigParam_t#

enum cudssConfigParam_t#
The enumerator specifies possible names of solver configuration settings. For each setting there is a matching type to be used in cudssConfigSet() or cudssConfigGet().
enumerator CUDSS_CONFIG_REORDERING_ALG#

Algorithm for the reordering phase Supported options are:

  • CUDSS_ALG_DEFAULT - a customized nested dissection algorithm based on METIS.

  • CUDSS_ALG_1 - a custom combination of block triangular reordering and COLAMD algorithms which can be used together with global pivoting to increase solution pivots. When this option is used for reordering, cuDSS uses an appropriate custom factorization algorithm (without the need to change the factorization setting CUDSS_CONFIG_FACTORIZATION_ALG).

  • CUDSS_ALG_2 - similar to CUDSS_ALG_1 that it implies using a special factorization algorithm which is tailored for a block-triangular representation but, unlike CUDSS_ALG_1, this option uses a trivial block structure.

  • CUDSS_ALG_3 - an approximate minimum degree (AMD) reordering.

Note: CUDSS_ALG_1 and CUDSS_ALG_2 are only supported for general (non-symmetric or non-hermitian) matrices.
Note: CUDSS_ALG_1 uses an upper bound on the number of non-zero entries in the factors factors. If this bound appears to be not sufficient during the factorization phase, a runtime device error is returned (which can be checked by synchronizing the stream and calling cudssDataGet() with CUDSS_DATA_INFO which will output the device error). In order to set a non-default upper bound, one should call cudssConfigSet() with CUDSS_CONFIG_MAX_LU_NNZ setting.
Note: CUDSS_ALG_3 does not support the matrix index type int64_t.

Associated parameter type: cudssAlgType_t

Default value: CUDSS_ALG_DEFAULT

enumerator CUDSS_CONFIG_FACTORIZATION_ALG#

Algorithm for the factorization phase Supported options are:

  • CUDSS_ALG_DEFAULT - the default factorization algorithm. this option chooses the best fitting factorization algorithm based on the choice of the reordering algorithm and sparsity structure produced by it. It is not recommended to use other options.

  • CUDSS_ALG_1 - a modification of the default algorithm.

Associated parameter type: cudssAlgType_t

Default value: CUDSS_ALG_DEFAULT

enumerator CUDSS_CONFIG_SOLVE_ALG#

Algorithm for the solving phase Supported options are:

  • CUDSS_ALG_DEFAULT - the default solve algorithm.

Other options are not supported.

Associated parameter type: cudssAlgType_t

Default value: CUDSS_ALG_DEFAULT

enumerator CUDSS_CONFIG_PIVOT_EPSILON_ALG#

Algorithm for the pivot epsilon calculation Supported options are:

  • CUDSS_ALG_DEFAULT - pivots with magnitude smaller than epsilon will be replaced by the appropriately signed epsilon.

  • CUDSS_ALG_1 - pivots with magnitude smaller than epsilon will be replaced by the appropriately signed and scaled epsilon. The scale will be computed as the maximum element in the corresponding row and column of the original matrix.

Note: CUDSS_ALG_1 for CUDSS_CONFIG_PIVOT_EPSILON_ALG is not supported when CUDSS_CONFIG_REORDERING_ALG is set to CUDSS_ALG_1 or CUDSS_ALG_2

Associated parameter type: cudssAlgType_t

Default value: CUDSS_ALG_DEFAULT

enumerator CUDSS_CONFIG_USE_MATCHING#

Flag for enabling/disabling matching. Matching is as an optional preprocessing step which computes a (non-symmetric column) permutation to put larger values on the diagonal which often improves the accuracy of the solution. This permutation is then combined with the permutation from the reordering. While matching often reduces the number of perturbed pivots and improves accuracy of the solution (especially for ill-conditioned and badly scaled matrices) there are no guarantees that the accuracy will improve. Enabling matching brings an overhead during the analysis step and changes the non-zero pattern of the factors and therefore, can slow down factorization and solve.

Note: matching routines require their internal workspace size to fit into the limit of 32-bit integers. The workspace limit depend on the choice of the matching algorithm. The biggest requirement among them is to have 10 * nrows + nnz < INT_MAX (nrows and nnz for the input matrix without accounting for symmetry).

Associated parameter type: int

Default value: 0 (matching disabled)

enumerator CUDSS_CONFIG_MATCHING_ALG#

Algorithm for matching Matching algorithm setting is only used if CUDSS_CONFIG_USE_MATCHING is not equal to 0.

Supported options are:

  • CUDSS_ALG_DEFAULT - same as CUDSS_ALG_5 (the most robust option).

  • CUDSS_ALG_1 - this option is based on job = 1 from MC64 algorithm (HSL). Computes a column permutation of the matrix so that the permuted matrix has as many entries on its diagonal as possible. The values on the diagonal are of arbitrary size. Note: this option does not use matrix values and thus can actually make the accuracy worse. It is not recommended to use this algorithm for matching unless there is a justified need.

  • CUDSS_ALG_2 - this option is based on job = 2 from MC64 algorithm (HSL). Computes a column permutation of the matrix so that the smallest value on the diagonal of the permuted matrix is maximized.

  • CUDSS_ALG_3 - this option is based on job = 3 from MC64 algorithm (HSL). Computes a column permutation of the matrix so that the smallest value on the diagonal of the permuted matrix is maximized. This algorithm is different from the one used for CUDSS_ALG_2.

  • CUDSS_ALG_4 - this option is based on job = 4 from MC64 algorithm (HSL). Computes a column permutation of the matrix so that the sum of the diagonal entries of the permuted matrix is maximized.

  • CUDSS_ALG_5 - this option is based on job = 5 from MC64 algorithm (HSL). Computes a column permutation of the matrix so that the product of the diagonal entries of the permuted matrix is maximized. In addition, this algorithm computes the row/col scaling vectors which can further improve accuracy of the solution.

Note: one of the options, CUDSS_ALG_5, computes scaling vectors in addition to the matching permutation. This option is considered to be the most impactful and it is recommended to use this option if accuracy of the solution needs to be improved. However, this option requires matrix values to be present during the analysis step.

Note: matching is not supported for CUDSS_ALG_1 and CUDSS_ALG_2 reordering algorithms (which use global pivoting to make the solution more accurate) or distributed matrices.

Associated parameter type: cudssAlgType_t

Default value: CUDSS_ALG_DEFAULT (matching with scaling, requires matrix values)

enumerator CUDSS_CONFIG_SOLVE_MODE#

Potential modificator on the system matrix (e.g. transpose or conjugate transpose)

Associated parameter type: int

Default value: 0 (no modificator).

Other values are not supported.

enumerator CUDSS_CONFIG_IR_N_STEPS#

Number of steps during the iterative refinement

Associated parameter type: int

Default value: 0

enumerator CUDSS_CONFIG_IR_TOL#

Iterative refinement tolerance

Associated parameter type: double

Currently it is ignored (exactly CUDSS_CONFIG_IR_N_STEPS steps are made)

enumerator CUDSS_CONFIG_PIVOT_TYPE#

Type of pivoting

The exact meaning of this parameter depends on the choice of the reordering algorithm which can be changed through CUDSS_CONFIG_REORDERING_ALG. For CUDSS_ALG_1 and CUDSS_ALG_2 the parameter refers to the global pivoting, while for CUDSS_ALG_DEFAULT and others it refers to the partial 1x1 pivoting procedure.

Note that in the latter case, if the matrix type is symmetric but not positive-definite, then the pivoting is also symmetric and the pivot is searched for on the diagonal of the block, if only the pivot type is not equal to CUDSS_PIVOT_NONE.

For more details, see the description of the pivoting strategies.

Associated parameter type: cudssPivotType_t

Default value: CUDSS_PIVOT_COL.

enumerator CUDSS_CONFIG_PIVOT_THRESHOLD#

Pivoting threshold \(p_{threshold}\) which is used to determine if diagonal element is subject to pivoting and will be swapped with the maximum element in the row (or column) depending on the type of pivoting. The diagonal element will be swapped if: \(p_{threshold} \cdot max_{(sub)row \, or \, col} |a_{ij}| \geq |a_{ii}|\)

For more details, see the description of the pivoting strategies.

Note: this parameter is only supported when reordering algorithm is set to CUDSS_ALG_1 or CUDSS_ALG_2.

Associated parameter type: double

Default value: 1.0f.

enumerator CUDSS_CONFIG_PIVOT_EPSILON#

Pivoting epsilon.

By default, this is the absolute value to test and replace small diagonal elements encountered during numerical factorization.

In case CUDSS_CONFIG_EPSILON_ALG is set to CUDSS_ALG_1, this value will be additionally scaled (see more details in the description of CUDSS_CONFIG_EPSILON_ALG).

Associated parameter type: double

Default value: 1e-5 for single precision and 1e-13 for double precision.

enumerator CUDSS_CONFIG_MAX_LU_NNZ#

Upper limit on the number of nonzero entries in LU factors. This is only relevant for non-symmetric matrices and reordering algorithm set to CUDSS_ALG_1 or CUDSS_ALG_2. If the number of non-zero entries in L and U exceeds the set limit, a runtime error happen. See also the note for CUDSS_ALG_1 in the table entry for CUDSS_CONFIG_REORDERING_ALG.

Associated parameter type: int64_t

Default value: -1 (then the parameter value is ignored and cuDSS uses an estimate of 100 nnz where nnz is the number of non-zero entries in the input matrix).

enumerator CUDSS_CONFIG_HYBRID_MODE#

Memory mode: 0 (default = device-only) or 1 (hybrid = host/device).

Note: Hybrid memory mode should be enabled before the analysis phase (cudssExecute() with CUDSS_PHASE_ANALYSIS). If the decision to use the hybrid mode is done after the analysis phase, the hybrid memory mode should be enabled and analysis phase must be re-done (which is sub-optimal).

Note: Unlike the hybrid execution mode (see CUDSS_CONFIG_HYBRID_EXECUTE_MODE) which controls where compute kernels are executed, the hybrid memory mode only allows cuDSS to keep part of the factor values (internal data) on the host and always uses GPU for factorization and solve.

For more details regarding the hybrid memory mode, see Hybrid mode feature.

Associated parameter type: int

Default value: 0 (disabled).

Currently not supported when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering, or, when CUDSS_ALG_1 is used for the factorization.

enumerator CUDSS_CONFIG_HYBRID_DEVICE_MEMORY_LIMIT#

User-defined device memory limit (number of bytes) for the hybrid memory mode.

This setting only affects execution when the hybrid memory mode is enabled. For more details regarding the hybrid memory mode, see Hybrid mode feature.

Note: In case of multiple devices (cudssCreateMg()) this value must be set for each device separately by calling cudaSetDevice() prior to cudssConfigSet()

Associated parameter type: int64_t

Default value: -1 (use the internal default heuristic).

enumerator CUDSS_CONFIG_USE_CUDA_REGISTER_MEMORY#

A flag to enable or disable usage of cudaHostRegister() by cuDSS hybrid memory mode. Since the hybrid memory mode of cuDSS uses host memory to store the factors, it can use cudaHostRegister() (if the HW supports it) to speedup associated host-to-device and device-to-host memory transfers. However, registering host memory has limitations and in some cases might lead to slowdowns. it. If the flag is set to 0, cuDSS will not attempt to use cudaHostRegister() even if the HW supports it. This setting only affects execution when the hybrid memory mode is enabled. For more details regarding the hybrid memory mode, see Hybrid mode feature.

Associated parameter type: int

Default value: 1 (use cudaHostRegister() if the device supports it)

enumerator CUDSS_CONFIG_HOST_NTHREADS#

Number of threads to be used by cuDSS in MT mode. This setting only affects execution when the multi-threaded mode is enabled.

Associated parameter type: int

Default value: -1 (use number of threads returned by cudssGetMaxThreads())

enumerator CUDSS_CONFIG_HYBRID_EXECUTE_MODE#

Execute mode: 0 (default = device-only) or 1 (hybrid = host/device). Hybrid execute mode allows cuDSS to perform calculations on both GPU and CPU. Currently it is used to speed up execution parts with low parallelization capacity.

Note: Reordering part of the analysis step is performed on CPU regardless of execute mode value.

Note: Hybrid execute mode should be enabled before the analysis phase (cudssExecute() with CUDSS_PHASE_ANALYSIS).

Note: Unlike the hybrid execution mode which controls where compute kernels are executed (and allows greater flexibility in placement of the input data), the hybrid memory mode CUDSS_CONFIG_HYBRID_MODE only allows cuDSS to keep part of the factor values (internal data) on the host while only GPU is used for factorization and solve.

Note: If hybrid execute mode is enabled input matrix, right hand side and solution can be host memory pointers.

For more details regarding the hybrid execute mode, see Hybrid execute mode feature.

Associated parameter type: int

Default value: 0 (disabled). Currently not supported when nrhs > 1, or CUDSS_CONFIG_HYBRID_MODE or MGMN mode is used, or, when batchCount is greater than 1.

enumerator CUDSS_CONFIG_ND_NLEVELS#

Minimum number of levels for the nested dissection reordering. The value for this parameter should be a positive integer. Additionally, for the MGMN mode the number of levels will be automatically increased (if needed) to satisfy the requirement: \(2^{n_{levels}-1} \geq n_{proc}\) where \(n_{proc}\) is the number of processes in the communicator. This setting only works when reordering algorithm is CUDSS_ALG_DEFAULT.

Note: This is considered an advanced performance knob and it is recommended to use a non-default value only when optimizing for performance. Typical values to try should not be too far from the default value, e.g. from the range 8 - 11.

Associated parameter type: int

Default value: 10.

enumerator CUDSS_CONFIG_UBATCH_SIZE#

The number of matrices in a uniform batch of systems to be processed by cuDSS. Uniform batch of matrices is defined as a batch of matrices with the same non-zero pattern but potentially different values. To create a uniform batch of matrices (unlike a non-uniform batch) one can simply use the usual cudssMatrixCreateCsr() or cudssMatrixCreateDn() with the only change: as a pointer to the matrix values one should provide a buffer which holds values for all matrices in the batch. Thus, it should have nnz * CUDSS_CONFIG_UBATCH_SIZE elements for sparse matrices (csr_values only) and nrows * ncols * CUDSS_CONFIG_UBATCH_SIZE elements for dense matrices.

Note: a single system can be viewed as a uniform batch with CUDSS_CONFIG_UBATCH_SIZE set to 1.

There are two ways how cuDSS can process a uniform batch based on the value of the CUDSS_CONFIG_UBATCH_INDEX: either factorizing (or solving) all matrices at once or just one at a time.

For details, see the description of CUDSS_CONFIG_UBATCH_INDEX.

Note: CUDSS_CONFIG_UBATCH_SIZE must be set before calling CUDSS_PHASE_ANALYSIS.

Associated parameter type: int

Default value: 1.

Currently not supported when either CUDSS_CONFIG_HYBRID_MODE (see here) or CUDSS_CONFIG_HYBRID_EXECUTE_MODE (see here) are enabled, or MGMN mode is used, or, with CUDSS_ALG_1 and CUDSS_ALG_2 reordering algorithms.

enumerator CUDSS_CONFIG_UBATCH_INDEX#

-1 or a 0-based index of matrix in a uniform batch which will be processed during factorization or solve phase. Special value of -1 can be used to tell cuDSS to process all matrices in the uniform batch at once.

Note: CUDSS_CONFIG_UBATCH_INDEX must be less than CUDSS_CONFIG_UBATCH_SIZE.

Note: if CUDSS_CONFIG_UBATCH_INDEX is set then cuDSS will treat the input sparse matrix, right hand side and solution as regular matrices (with just one set of values). Thus, it allows to process a uniform batch of matrices one by one while re-using the result of the analysis and keeping corresponding factors for all matrices in a single cudssData_t object.

Note: CUDSS_CONFIG_UBATCH_INDEX can be set after CUDSS_PHASE_FACTORIZATION, in that case one could solve only one specific system

Associated parameter type: int

Default value: -1

Currently not supported when either CUDSS_CONFIG_HYBRID_MODE (see here) or CUDSS_CONFIG_HYBRID_EXECUTE_MODE (see here) are enabled, or MGMN mode is used, or, with CUDSS_ALG_1 and CUDSS_ALG_2 reordering algorithms.

enumerator CUDSS_CONFIG_USE_SUPERPANELS#

Use superpanel optimization: 1 (default = enabled) or 0 (disabled).

Note: superpanel optimization should be enabled before the analysis phase (cudssExecute() with CUDSS_PHASE_ANALYSIS). If the decision to use the superpanel optimization mode is done after the analysis phase, the superpanel optimization should be enabled and analysis phase must be re-done which is sub-optimal).

Associated parameter type: int

Default value: 1 (enabled).

enumerator CUDSS_CONFIG_DEVICE_COUNT#

Device count in case of multiple device (see cudssCreateMg())

Note: Maximum supported device count is 16

Associated parameter type: int

Default value: 1

enumerator CUDSS_CONFIG_DEVICE_INDICES#

A list of device indices as an integer array.

Note: The calling device must be equal to the first device index from device_indices or 0 (if device_indices were NULL in the required prior call to cudssCreateMg())

Associated parameter type: int

Default value: NULL (cuDSS will use devices from 0 to device_count - 1).

enumerator CUDSS_CONFIG_SCHUR_MODE#

Schur complement mode: 0 (default = disabled) or 1 (enabled).

For more details regarding the Schur complement feature, see Schur complement feature.

Associated parameter type: int

Default value: 0 (disabled).

Currently not supported when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering, when MGMN mode or multi-GPU mode is used, or, when CUDSS_ALG_1 is used for the factorization. It is also not supported when a user permutation is set, for uniform and non-uniform batches, or when matching is enabled.

enumerator CUDSS_CONFIG_DETERMINISTIC_MODE#

Enable deterministic mode. In deterministic mode, cuDSS is guaranteed to provide the bit-wise identical result at every run when executed on GPUs with the same architecture and the same number of SMs (assuming the input data and solver settings are also bit-wise identical).

Note: in deterministic mode cuDSS uses a different set of kernels which often might be slower than the kernels used in the default mode.

Associated parameter type: int

Default value: 0 (disabled).

Currently the feature is supported only for single-gpu, single rhs and with hybrid memory mode (CUDSS_CONFIG_HYBRID_MODE) and hybrid execute mode (CUDSS_CONFIG_HYBRID_EXECUTE_MODE) disabled.

cudssDataParam_t#

enum cudssDataParam_t#
The enumerator specifies possible parameter names which can set or get in the cudssData_t object. For each parameter name there is an associated type to be used in cudssDataSet() or cudssDataGet(). Each parameter name is marked with “in” or “out” depending on whether a parameter can be only set, get or be involved in both.
enumerator CUDSS_DATA_INFO#

By default, this parameter can be used to detect device-side asynchronous errors (primarily, during factorization and solve). However, in the special case, when in hybrid execute mode factorization is done on the host at least partially, this parameter can be used to retrieve host-side errors. In case execution is successful, the value is 0.

Note: the error returned via CUDSS_DATA_INFO is independent from the host side status of type cudssStatus_t returned by all cuDSS routines, including cudssExecute(). One of the noticeable use cases is when a matrix of the system is passed with an mtype for positive-definite matrices but it appears to have non-positive minors (at least numerically). In this case, calling cudssDataGet() with CUDSS_DATA_INFO after the factorization phase will return the 1-based index of the first encountered non-positive minor. Then one can set the device error back to zero via cudssDataSet(), and either change the matrix type or adjust the matrix values (to make the matrix positive-definite) and call the factorization phase again.

Note that the returned index is 1-based and is for the reordered matrix. To get the corresponding original index, it should be combined with (inverse) permutation which can be extracted via cudssDataGet() for CUDSS_DATA_PERM_REORDER_ROW.

Direction: out

Memory: host

Associated parameter type: int

enumerator CUDSS_DATA_LU_NNZ#

Number of non-zero entries in LU factors.

Note: in case batchCount > 1 (non-uniform batch) cudssDataGet() returns accumulated number over all matrices in the batch.

Direction: out

Memory: host

Associated parameter type: int64_t

enumerator CUDSS_DATA_NPIVOTS#

Number of pivots encountered during factorization.

Direction: out

Memory: host

Associated parameter type: same as for the indices of the sparse matrix of the system

enumerator CUDSS_DATA_INERTIA#

Positive and negative indices of inertia for the system matrix A (two integer values). Valid only for symmetric/Hermitian non positive-definite matrix types.

Note: in case batchCount > 1 (non-uniform batch) cudssDataGet() returns the accumulated number over all matrices in the batch.

Direction: out

Memory: host

Associated parameter type: same as for the indices of the sparse matrix of the system

enumerator CUDSS_DATA_PERM_REORDER_ROW#

Row permutation P after reordering such that A[P,Q] is factorized.

Note: using this parameter in cudssDataGet() in case batchCount > 1 is not supported.

Direction: out

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

enumerator CUDSS_DATA_PERM_REORDER_COL#

Column permutation Q after reordering such that A[P,Q] is factorized.

Note: using this parameter in cudssDataGet() in case batchCount > 1 is not supported.

Direction: out

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

enumerator CUDSS_DATA_PERM_ROW#

Final row permutation P (includes effects of both reordering and pivoting) which is applied to the original right-hand side of the system in the form \(b_{new} = b_{old} \circ P\)

Note: using this parameter in cudssDataGet() in case batchCount > 1 is not supported.

Direction: out

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

Currently supported only when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.

enumerator CUDSS_DATA_PERM_COL#

Final column permutation Q (includes effects of both reordering and pivoting) which is applied to transform the solution of the permuted system into the original solution \(x_{old} = x_{new} \circ Q^{-1}\)

Note: using this parameter in cudssDataGet() in case batchCount > 1 is not supported.

Direction: out

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

Currently supported only when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.

enumerator CUDSS_DATA_PERM_MATCHING#

Matching (column) permutation Q such that A[:,Q] is reordered and then factorized.

Direction: out

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

enumerator CUDSS_DATA_DIAG#

Diagonal of the factorized matrix

Note: in case batchCount > 1 (non-uniform batch) cudssDataGet() returns the accumulated values over all matrices in the batch.

Direction: out

Memory: host or device

Associated parameter type: same as for the values of the sparse matrix of the system

Currently supported only when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.

enumerator CUDSS_DATA_SCALE_ROW#

Row scaling the factorized matrix (corresponding to the rows of the original matrix)

Direction: out

Memory: host or device

Associated parameter type: floating point for the absolute values of the matrix of of the system (i.e., always real, either float or double)

Only supported when matching is enabled and matching algorithm computes the scaling.

enumerator CUDSS_DATA_SCALE_COL#

Column scaling the factorized matrix (corresponding to the columns of the original matrix).

Direction: out

Memory: host or device

Associated parameter type: floating point for the absolute values of the matrix of of the system (i.e., always real, either float or double) Only supported when matching is enabled and matching algorithm computes the scaling.

enumerator CUDSS_DATA_USER_PERM#

User permutation to be used instead of running the reordering algorithms. The user permutation can be disabled by providing a NULL with zero size with cudssDataSet(). The provided buffer should be an integer array of size n, where n is the number of rows/columns of the system matrix. The integer type must match the one used for the system matrix. The values should represent a valid permutation vector for the set \(\{ index base, ..., n - index base \}\)

Note: after cudssDataSet() is called with CUDSS_DATA_USER_PERM, the data are copied into an internal buffer so that the user buffer can be deallocated or re-used.

Note: using this parameter in cudssDataGet() in case batchCount > 1 is not supported.

Direction: in

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

Currently not supported when CUDSS_ALG_1 or CUDSS_ALG_2 is used for reordering.

enumerator CUDSS_DATA_ELIMINATION_TREE#

Elimination tree information, also known as separator sizes, which are computed during the reordering phase and used for improving parallelization. The number of elements in this array is \(2 \cdot n_{levels} - 1 = 2 \cdot (log_2(ND\_NLEVELS) - 1) - 1\) with \(ND\_NLEVELS\) from CUDSS_CONFIG_ND_NLEVELS. CUDSS_USER_PERMUTATION and CUDSS_DATA_USER_ELIMINATION_TREE for a matrix with the same sparsity structure. This avoids regenerating the same reordering information while retaining the same performance for the factorization and solve phases.

See the Elimination Tree section for more details.

Direction: out

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

enumerator CUDSS_DATA_USER_ELIMINATION_TREE#

User provided elimination tree information, which is used instead of running the reordering algorithm (therefore, saving runtime). It must be used in combination with CUDSS_DATA_USER_PERM to have an effect.

See CUDSS_DATA_ELIMINATION_TREE for the size and usage information. The user elimination tree can be disabled by providing a NULL with zero size with cudssDataSet().

Note: While it is possible to construct this array outside cuDSS (e.g., extracting the so-called sizes array from METIS / ParMETIS), we recommend to only use this feature by extracting the elimination tree from cuDSS, followed by passing it back in subsequent calls.

Direction: in

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system.

enumerator CUDSS_DATA_MEMORY_ESTIMATES#

Memory estimates (in bytes) for host and device memory required for the chosen memory mode. The chosen memory mode is defined as the memory mode detected during the last executed analysis phase for the current cudssData_t object and the state of the corresponding cudssConfigParam_t object during the call.

See cudssConfigParam_t for more details.

Note: the returned memory estimate depends not just on the settings from cudssConfig_t object but also on the input matrix and rhs (specifically, nrhs) so if the objects (or the config) change after the analysis phase, memory estimates might be not accurate anymore.

Values returned in the output array at position: * 0 - permanent device memory * 1 - peak device memory * 2 - permanent host memory * 3 - peak host memory * 4 - (if in hybrid memory mode) minimum device memory for the hybrid memory mode * 5 - (if in hybrid memory mode) maximum host memory for the hybrid memory mode * 6, … 15 - reserved for future use

This query must be done after the analysis phase and will return status CUDSS_STATUS_NOT_SUPPORTED if it cannot be processed.

Note: In case of multiple devices (cudssCreateMg()) memory estimate will be done for the calling device only. Thus, cudaSetDevice() must be called prior to cudssDataGet() to get memory estimate on corresponded device

Direction: out

Memory: host

Associated parameter type: int64_t[16]

enumerator CUDSS_DATA_HYBRID_DEVICE_MEMORY_MIN#

Minimal amount of device memory (number of bytes) required in the hybrid memory mode. This query must be done after the analysis phase and will return status CUDSS_STATUS_NOT_SUPPORTED if it cannot be processed.

Note: In case of multiple devices (cudssCreateMg()) minimal memory will be returned for the calling device only. Thus, cudaSetDevice() must be called prior to cudssDataGet() to get minimal memory on corresponded device

Direction: out

Memory: host

Associated parameter type: int64_t

For more details regarding the hybrid memory mode, see Hybrid mode feature.

enumerator CUDSS_DATA_COMM#

Communicator for MGMN mode. The actual type of the communicator must match the communication layer which must be set via calling cudssSetCommLayer() for the cuDSS library handle via cudssSetCommLayer().

Direction: in

Memory: host

Associated parameter type: void*

For more details regarding the MGMN mode, see MGMN mode.

enumerator CUDSS_DATA_NSUPERPANELS#

Number of superpanels in the matrix.

Direction: out

Memory: host

Associated parameter type: same as for the indices of the sparse matrix of the system

enumerator CUDSS_DATA_USER_SCHUR_INDICES#

User-provided Schur complement indices. The provided buffer should be an integer array of size n, where n is the number of rows (columns) of the system matrix, and the integer type should match the one used for the system matrix. The values should be equal to 1 for the rows/columns which are part of the Schur complement and 0 for the rest.

Note: after cudssDataSet() is called with CUDSS_DATA_USER_SCHUR_INDICES, the data are copied into an internal buffer so that the user buffer can be deallocated or re-used.

Direction: in

Memory: host or device

Associated parameter type: same as for the indices of the sparse matrix of the system

enumerator CUDSS_DATA_SCHUR_SHAPE#

Shape of the Schur complement matrix as a triplet (nrows, ncols, sparse nnz). The last entry is only relevant when the Schur complement needs to be exported as a sparse matrix By querying with cudssDataSet() with CUDSS_DATA_SCHUR_SHAPE, users can get the necessary allocation sizes for the Schur complement buffers which can then be used to create a cudssMatrix_t object and passed back to cuDSS to fill in the Schur complement data via cudssDataSet() with CUDSS_DATA_SCHUR_MATRIX.

Direction: out

Memory: host

Associated parameter type: int64_t[3]

enumerator CUDSS_DATA_SCHUR_MATRIX#

Schur complement matrix as a cudssMatrix_t object.

Direction: in

Memory: host (with underlying buffers on host or device)

Associated parameter type: cudssMatrix_t

enumerator CUDSS_DATA_USER_HOST_INTERRUPT#

User-provided host interrupt pointer. Setting this pointer to a non-NULL value enables user interruption of cuDSS routines on the host. Specifically, in the calls to cudssExecute() cuDSS will perform checks on the host for the value of the pointer. If it is non-zero, execution stops and cuDSS returns early (before the phase execution completes) with the status CUDSS_STATUS_EXECUTION_FAILED. In case the application needs to continue cuDSS execution after the interruption, users can simple set the pointer’s value to 0 and repeat the call. To remove the (small) overhead of checking for the host interruption, the interruption can be completely disabled by setting CUDSS_DATA_USER_HOST_INTERRUPT to NULL.

Note: since most of device kernels inside cuDSS are launched asynchronously, execution on the GPU may continue some time after the host execution has been interrupted.

Direction: in

Memory: host

Default value: NULL (user interruption disabled)

Associated parameter type: int

cudssPhase_t#

enum cudssPhase_t#
The enumerator specifies the solver phases to be performed in the main cuDSS routine cudssExecute(). Phases can be combined with the binary OR operator (|) to specify multiple phases at once: CUDSS_PHASE_FACTORIZATION | CUDSS_PHASE_SOLVE.
enumerator CUDSS_PHASE_REORDERING#

Reordering

enumerator CUDSS_PHASE_SYMBOLIC_FACTORIZATION#

Symbolic factorization

Note: it is not allowed to call symbolic factorization twice without calling the reordering phase in-between.

enumerator CUDSS_PHASE_ANALYSIS#

Reordering and symbolic factorization combined

enumerator CUDSS_PHASE_FACTORIZATION#

Numerical factorization

enumerator CUDSS_PHASE_REFACTORIZATION#

Numerical re-factorization.

Note: For now it is only used if the reordering algorithm is set to CUDSS_ALG_1 or CUDSS_ALG_2. Otherwise, it is the same as FACTORIZATION phase.

enumerator CUDSS_PHASE_SOLVE_FWD_PERM#

Applying reordering permutation to the right hand side before the forward solve (forward substitution).

Note: solve sub-phases are not supported when CUDSS_ALG_1 or CUDSS_ALG_2 are used for reordering.

enumerator CUDSS_PHASE_SOLVE_FWD#

Forward substitution sub-step of the solving phase.

Note: this phase includes the local permutation due to the partial pivoting. To remove this effect, if undesired, one can use a combination of reordering and final permutation (see CUDSS_DATA_PERM_REORDER_ROW and CUDSS_DATA_PERM_ROW).

Note: solve sub-phases are not supported when CUDSS_ALG_1 or CUDSS_ALG_2 are used for reordering.

enumerator CUDSS_PHASE_SOLVE_DIAG#

Diagonal solve sub-step of the solving phase (only needed for symmetric/Hermitian matrix types).

Note: solve sub-phases are not supported when CUDSS_ALG_1 or CUDSS_ALG_2 are used for reordering.

enumerator CUDSS_PHASE_SOLVE_BWD#

Backward substitution sub-step of the solving phase.

Note: this phase includes the local permutation due to the partial pivoting. To remove this effect, if undesired, one can use a combination of reordering and final permutation (see CUDSS_DATA_PERM_REORDER_ROW and CUDSS_DATA_PERM_ROW).

Note: solve sub-phases are not supported when CUDSS_ALG_1 or CUDSS_ALG_2 are used for reordering.

enumerator CUDSS_PHASE_SOLVE_BWD_PERM#

Applying inverse reordering permutation to the intermediate solution after the backward (backward substitution). If matching (and scaling) is enabled, this phase also includes applying the inverse matching permutation and inverse scaling (as the matching permutation and scalings were used to modify the matrix before the factorization).

Note: solve sub-phases are not supported when CUDSS_ALG_1 or CUDSS_ALG_2 are used for reordering.

enumerator CUDSS_PHASE_SOLVE_REFINEMENT#

Iterative refinement (see configuration settings CUDSS_CONFIG_IR_N_STEPS and CUDSS_CONFIG_IR_TOL).

Note: solve sub-phases are not supported when CUDSS_ALG_1 or CUDSS_ALG_2 are used for reordering.

enumerator CUDSS_PHASE_SOLVE#

Full solving phase, combining all of the above (forward permutation + forward solve + diagonal solve + backward permutation + backward solve) and (optional) iterative refinement.

Note: combining the solve sub-phases should be preferred to calling sub-phases separately.

Note: Changing the sparse matrix for this phase is allowed but with restrictions, see limitations for cudssExecute().

cudssMatrixFormat_t#

enum cudssMatrixFormat_t#
The enumerator specifies the underlying matrix formats inside a cuDSS matrix object.
enumerator CUDSS_MFORMAT_DENSE#

Matrix is dense (applied to a single matrix and a batch equally)

enumerator CUDSS_MFORMAT_CSR#

Matrix is in CSR format (applies to a single matrix and a batch equally)

Note: Only 3-array CSR is supported.

enumerator CUDSS_MFORMAT_BATCH#

Matrix object represents a batch of matrices

Note: The format flags can be combined. E.g., creating a cudssMatrix_t via cudssMatrixCreateBatchCsr() would set both CUDSS_MFORMAT_CSR and CUDSS_MFORMAT_BATCH flags. One can check for a mixture of flags via bit-wise operations, e.g. CUDSS_MFORMAT_CSR | CUDSS_MFORMAT_BATCH for a batch of CSR matrices.

cudssMatrixType_t#

enum cudssMatrixType_t#
The enumerator specifies available matrix types for sparse matrices. Matrix type should be used to describe the properties of the underlying matrix storage. Matrix type affects the decision about what type of factorization is computed by the solver. E.g, when matrix type is one of the positive-definite types, checks for singular values on the diagonal is not done.
enumerator CUDSS_MTYPE_GENERAL#

General matrix [default] LDU factorization will be computed with optional local or global pivoting

enumerator CUDSS_MTYPE_SYMMETRIC#

Real symmetric matrix. LDL^T factorization will be computed with optional local pivoting

enumerator CUDSS_MTYPE_HERMITIAN#

Complex Hermitian matrix. LDL^H factorization will be computed with optional local pivoting

enumerator CUDSS_MTYPE_SPD#

Symmetric positive-definite matrix Cholesky factorization will be computed with optional local pivoting

Note: if the matrix passed with this matrix type appears to have zero minors (at least numerically), one can get the 1-based index of the first encountered non-positive minor by calling cudssDataGet() with CUDSS_DATA_INFO. As this would be a device-side error, the call to cudssExecute() may still return CUDSS_STATUS_SUCCESS.

enumerator CUDSS_MTYPE_HPD#

Hermitian positive-definite matrix Complex Cholesky factorization will be computed with optional local pivoting

Note: if the matrix passed with this matrix type appears to have zero minors (at least numerically), one can get the 1-based index of the first non-positive minor by calling cudssDataGet() with CUDSS_DATA_INFO. As this would be a device-side error, the call to cudssExecute() may still return CUDSS_STATUS_SUCCESS.

cudssMatrixViewType_t#

enum cudssMatrixViewType_t#
The enumerator specifies available matrix view types for sparse matrices. Matrix view defines how the matrix is treated by the main cuDSS routine cudssExecute(). E.g., to provide only upper-triangle data for a symmetric matrix one can use as CUDSS_MTYPE_SYMMETRIC as matrix type combined with CUDSS_MVIEW_UPPER as the matrix view. If the accompanying matrix type is CUDSS_MTYPE_GENERAL, the matrix view is ignored.
enumerator CUDSS_MVIEW_FULL#

Full matrix [default]

enumerator CUDSS_MVIEW_LOWER#

Lower-triangular matrix (including the diagonal). All values above the main diagonal will be ignored.

enumerator CUDSS_MVIEW_UPPER#

Upper-triangular matrix (including the diagonal). All values below the main diagonal will be ignored.

cudssIndexBase_t#

enum cudssIndexBase_t#
The enumerator specifies indexing base (0 or 1) for sparse matrix indices (row start/end offsets and column indices). Once set for a sparse matrix, cudssExecute() will use the indexing base from the input sparse matrix for all index-related data (e.g. output from cudssDataGet() called with CUDSS_DATA_PERM_REORDER).
enumerator CUDSS_BASE_ZERO#

Zero-based indexing [default]

enumerator CUDSS_BASE_ONE#

One-based indexing

cudssLayout_t#

enum cudssLayout_t#
The enumerator specifies dense matrix layout.
enumerator CUDSS_LAYOUT_COL_MAJOR#

Column-major layout [default]

enumerator CUDSS_LAYOUT_ROW_MAJOR#

Row-major layout. Currently not supported.

cudssAlgType_t#

enum cudssAlgType_t#
The enumerator specifies algorithm choices to be made for different solver settings like reordering,
factorization and other algorithms.
enumerator CUDSS_ALG_DEFAULT#

Default value [default] Uses the default algorithm (decided by cuDSS).

enumerator CUDSS_ALG_1#

First algorithm See the description of the specific configuration setting (algorithm for reordering, factorization, etc.)

enumerator CUDSS_ALG_2#

Second algorithm See the description of the specific configuration setting (algorithm for reordering, factorization, etc.)

enumerator CUDSS_ALG_3#

Third algorithm See the description of the specific configuration setting (algorithm for reordering, factorization, etc.)

enumerator CUDSS_ALG_4#

Fourth algorithm See the description of the specific configuration setting (algorithm for reordering, factorization, etc.)

enumerator CUDSS_ALG_5#

Fifth algorithm See the description of the specific configuration setting (algorithm for reordering, factorization, etc.)

Different values represent different algorithms (for reordering, factorization, etc.) and can lead to significant differences in accuracy and performance. It is currently recommended to use CUDSS_ALG_DEFAULT and only in case accuracy or performance are not sufficient, one can experiment with other values.

cudssPivotType_t#

enum cudssPivotType_t#
The enumerator specifies type of pivoting to be performed.
enumerator CUDSS_PIVOT_COL#

Column-based pivoting [default]

enumerator CUDSS_PIVOT_ROW#

Row-based pivoting

enumerator CUDSS_PIVOT_NONE#

No pivoting

Communication Layer (Distributed Interface) Types#

cudssDistributedInterface_t#

struct cudssDistributedInterface_t#
This struct defines all communication primitives which need to be implemented
in (any) implementation of cuDSS communication layer, see for more details
Note: all communication layer API functions below take an argument of type void * for comm.
This parameter should be interpreted in the implementation based on the underlying communication
backend to be used with the particular communication layer. E.g., for OpenMPI, comm should be
treated as the OpenMPI communicator.
Note: most of the APIs below take an argument called stream of type
cudaStream_t and must be stream-ordered. For GPU-aware communication backends like OpenMPI,
this implies the need to do explicit cudaStreamSynchronize() in the communication layer implementation.
int cudssCommRank(void *comm, int *rank)#
A function pointer to a routine which returns the rank of the process in a communicator.
Param comm:

[in] A pointer to the communicator.

Param rank:

[out] Rank of the calling process in the communicator.

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int cudssCommSize(void *comm, int *size)#
A function pointer to a routine which returns number of processes in a communicator.
Param comm:

[in] A pointer to the communicator.

Param size:

[out] Number of processes in the communicator.

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int cudssSend(
const void *buffer,
int count,
cudaDataType_t datatype,
int dest,
int tag,
void *comm,
cudaStream_t stream
)#
A function pointer to a routine which performs a blocking send.
Param buffer:

[in] Initial address of the send device buffer.

Param count:

[in] Number of elements (of type datatype) to be sent.

Param datatype:

[in] CUDA datatype of elements to be sent.

Param dest:

[in] Rank of the receiving process (destination).

Param tag:

[in] Message tag.

Param comm:

[in] A pointer to the communicator.

Param stream:

[in] CUDA stream.

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int cudssRecv(
void *buffer,
int count,
cudaDataType_t datatype,
int source,
int tag,
void *comm,
cudaStream_t stream
)#
A function pointer to a routine which performs a blocking receive for a message.
Param buffer:

[out] Initial address of the receive device buffer.

Param count:

[in] Number of elements (of type datatype) to be received.

Param datatype:

[in] CUDA datatype of elements to be received.

Param source:

[in] Rank of the sending process (source).

Param tag:

[in] Message tag.

Param comm:

[in] A pointer to the communicator.

Param stream:

[in] CUDA stream.

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int cudssBcast(
void *buffer,
int count,
cudaDataType_t datatype,
int root,
void *comm,
cudaStream_t stream
)#
A function pointer to a routine which performs a broadcast for a message from the root process
to all other processes of the communicator.
Param buffer:

[in] Number of elements (of type datatype) to be received.

Param datatype:

[in] CUDA datatype of elements to be received.

Param root:

[in] Rank of the sending process (source).

Param comm:

[in] A pointer to the communicator.

Param stream:

[in] CUDA stream.

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int cudssReduce(
const void *sendbuf,
void *recvbuf,
int count,
cudaDataType_t datatype,
cudssOpType_t op,
int root,
void *comm,
cudaStream_t stream
)#
Param sendbuf:

[in] CUDA datatype of elements to be sent.

Param recvbuf:

[in] Rank of the receiving process (destination).

Param count:

[in] Number of elements (of type datatype) to be received.

Param datatype:

[in] CUDA datatype of elements to be received.

Param op:

[in] Type of the reduction operation to be performed, see cudssOpType_t for supported values.

Param root:

[in] Rank of the root process (source).

Param comm:

[in] A pointer to the communicator.

Param stream:

[in] CUDA stream.

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int cudssAllreduce(
const void *sendbuf,
void *recvbuf,
int count,
cudaDataType_t datatype,
cudssOpType_t op,
void *comm,
cudaStream_t stream
)#
A function pointer to a routine which performs a reduction of values on all processes to a single value and
distributes the result back to all processes.
Param sendbuf:

[in] Address of the send buffer.

Param recvbuf:

[out] Address of the receive buffer.

Param count:

[in] Number of elements (of type datatype) to be received.

Param datatype:

[in] CUDA datatype of elements to be received.

Param op:

[in] Type of the reduction operation to be performed, see cudssOpType_t for supported values.

Param comm:

[in] A pointer to the communicator.

Param stream:

[in] CUDA stream.

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int cudssScatterv(
const void *sendbuf,
const int *sendcounts,
const int *displs,
cudaDataType_t sendtype,
void *recvbuf,
int recvcount,
cudaDataType_t recvtype,
int root,
void *comm,
cudaStream_t stream
)#
A function pointer to a routine which performs a scatter operation on a buffer in parts to all processes in a communicator.
Param sendbuf:

[in] Address of the send buffer.

Param sendcounts:

[in] Non-negative integer array (of length communicator size) specifying the number of elements to send to each rank.

Param displs:

[in] An array of integers of length communicator size. Entry i specifies the displacement (relative to sendbuf) from which to take the outgoing data to process i.

Param sendtype:

[in] CUDA datatype of elements to be sent.

Param recvbuf:

[out] Address of the receive buffer.

Param recvcount:

[in] Number of elements in receive buffer (non-negative integer).

Param recvtype:

[in] CUDA datatype of elements to be received.

Param root:

[in] Rank of the sending process (source).

Param comm:

[in] A pointer to the communicator.

Param stream:

[in] CUDA stream.

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int cudssCommSplit(
const void *comm,
int color,
int key,
void *new_comm
)#
A function pointer to a routine which creates new communicators based on colors and keys.
Param comm:

[in] A pointer to the communicator to be split.

Param color:

[in] Control of the subset assignment. Processes with the same color are grouped together.

Param key:

[in] Control of the rank assignment. Processes in the new communicator are ordered based on the keys.

Param new_comm:

[out] A pointer to the new communicator defined w.r.t to colors.

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

int cudssCommFree(void *comm)#
A function pointer to a routine which deallocates resources of a communicator.
Param comm:

[in] A pointer to the communicator to be freed.

Returns:

[out] The error status (as int) of the invocation. Must return 0 on success and any nonzero integer otherwise.

cudssOpType_t#

enum cudssOpType_t#
The enumerator specifies reduction operation to be used when calling
communication layer APIs cudssReduce() or cudssAllreduce().
enumerator CUDSS_SUM#

Reduced elements are added together.

enumerator CUDSS_MAX#

Maximum element is found among the reduced elements.

enumerator CUDSS_MIN#

Minimum element is found among the reduced elements.

Threading Layer Types#

cudssThreadingInterface_t#

struct cudssThreadingInterface_t#
This struct defines all threading primitives which need to be implemented
in (any) implementation of cuDSS threading layer, see for more details
int cudssGetMaxThreads()#
A function pointer to a routine which returns the maximum number of threads on the CPU that can be used by cuDSS for parallel execution.
Returns:

[out] The maximum number of threads on the CPU that can be used by cuDSS for parallel execution.

void cudssParallelFor(
int nthr_requested,
int ntasks,
void *ctx,
cudss_thr_func_t f
)#

A function pointer to a routine which opens a parallel for section with the requested number of threads and call cudss_thr_func_t (see cudss_threading_interface.h for details) f in the for loop with ntasks number of iterations.

Param nthr_requested:

[in] Requested number of threads for the parallel section.

Param ntasks:

[in] Number of tasks in the parallel for loop.

Param ctx:

[in] A pointer to the input data for cudss_thr_func_t f

which defines parallel units of work, also called tasks:

void cudss_thr_func_t(int task, void *ctx)#
Parameters:
  • task – [in] The task number (or iteration count of the parallel loop).

  • ctx – [in] A pointer to the input data.