cuDSS debugging tips and tricks#
Default mode#
It is helpful to understand that there are two different types of errors which can occur when using cuDSS:
- host-side errors Those can be checked by checking the host side status as all cuDSS API return a value of type cudssStatus_t which can be checked for - CUDSS_STATUS_SUCCESS.
- device-side errors: Those can occur asynchronously and thus require extra efforts to localize more precisely. The main routine of cuDSS, cudssExecute(), can detect and report some of the device-side errors if the users call cudssDataGet() for - CUDSS_DATA_INFO.- Note: calling cudssDataGet() performs a stream synchronization internally. Hence it is not advisable to call cudssDataGet() when it is not needed, as it comes at a cost of the stream synchronization. Thus, for performance reasons, it is suggested that the users insert as few calls to - cudssDataGetas possible, when not debugging.
Also, when debugging issues which can be related to cuDSS, the following general tips are valid:
- It is advisable to turn on the logging. For example, re-running with - CUDSS_LOG_LEVEL=5might provide useful information regarding the cause of the issue.
- Since cuDSS supports asynchronous execution, to get a more exact location of the cuDSS call which fails one can insert synchronization between cuDSS calls (specifically, for cudssExecute()). - Specifically, one can call - cudaStreamSynchronize(stream)followed by- cudaGetLastError()to check for potential CUDA API errors inside cuDSS. This can be combined with checking for cuDSS-specific device side errors via- cudssDataGet()as mentioned above.
MGMN mode#
While MGMN mode of cuDSS is subject to all debugging tips described above, there are extra tips specific to this mode.
- Since there is a large variety of errors which can occur in the multi-GPU/multi-node environments, in case issues occur when using the MGMN mode of cuDSS, it is advised to check the health of the system configuration with a simple benchmark of the specific communication backend used in the application using the exact same launch parameters. - E.g., for NCCL one could use NCCL tests.