cuDSS debugging tips and tricks#

Default mode#

It is helpful to understand that there are two different types of errors which can occur when using cuDSS:

host-side errors Those can be checked by checking the host side status as all cuDSS API return a value of type cudssStatus_t which can be checked for CUDSS_STATUS_SUCCESS.
device-side errors: Those can occur asynchronously and thus require extra efforts to localize more precisely. The main routine of cuDSS, cudssExecute(), can detect and report some of the device-side errors if the users call cudssDataGet() for CUDSS_DATA_INFO.

Note: calling cudssDataGet() performs a stream synchronization internally. Hence it is not advisable to call cudssDataGet() when it is not needed, as it comes at a cost of the stream synchronization. Thus, for performance reasons, it is suggested that the users insert as few calls to cudssDataGet as possible, when not debugging.

Also, when debugging issues which can be related to cuDSS, the following general tips are valid:

It is advisable to turn on the logging. For example, re-running with CUDSS_LOG_LEVEL=5 might provide useful information regarding the cause of the issue.
Since cuDSS supports asynchronous execution, to get a more exact location of the cuDSS call which fails one can insert synchronization between cuDSS calls (specifically, for cudssExecute()).

Specifically, one can call cudaStreamSynchronize(stream) followed by cudaGetLastError() to check for potential CUDA API errors inside cuDSS. This can be combined with checking for cuDSS-specific device side errors via cudssDataGet() as mentioned above.

While MGMN mode of cuDSS is subject to all debugging tips described above, there are extra tips specific to this mode.

Since there is a large variety of errors which can occur in the multi-GPU/multi-node environments, in case issues occur when using the MGMN mode of cuDSS, it is advised to check the health of the system configuration with a simple benchmark of the specific communication backend used in the application using the exact same launch parameters.

E.g., for NCCL one could use NCCL tests.