Troubleshooting#
Use the cuDNN error reporting and API logging utility to help with troubleshooting issues with applications developed by using the cuDNN backend API.
Error Reporting And API Logging#
The cuDNN error reporting and API logging is a utility for recording the cuDNN API execution and error information. For each cuDNN API function call, all input parameters are reported in the API logging. If errors occur during the execution of the cuDNN API, a traceback of the error conditions can also be reported to help troubleshooting. This functionality is disabled by default, and can be enabled by setting: CUDNN_LOGDEST_DBG and CUDNN_LOGLEVEL_DBG
. This section provides more details on how to set up this logging.
CUDNN_LOGLEVEL_DBG
can be set to one of four levels which are aligned with the numerical values of enum cudnnSeverity_t. The numerical values accepted are:
0
: no error and no performance penalty1
: error2
: warning and error3
: info, warning and error
Out of bounds integer numerical values are clipped to either 0
or 3
, whichever is the closest.
Notice that the logging levels are nested so that the more severe levels are included by enabling the less severe levels. Legacy logging environment variables CUDNN_LOGERR_DBG
, CUDNN_LOGWARN_DBG
, and CUDNN_LOGINFO_DBG
are deprecated and will continue to work during the v9 grace period for compatibility.
The log output contains variable names, data types, parameter values, device pointers, process ID, thread ID, cuDNN handle, CUDA stream ID, and metadata such as time of the function call in microseconds.
With severity level 3
, the log will include an entry after each public API call (that is, “info” logging), as in this example:
cuDNN (v90000) function cudnnSetActivationDescriptor() called: mode: type=cudnnActivationMode_t; val=CUDNN_ACTIVATION_RELU (1); reluNanOpt: type=cudnnNanPropagation_t; val=CUDNN_NOT_PROPAGATE_NAN (0); coef: type=double; val=1000.000000; Time: 2024-01-21T14:14:21.366171 (0d+0h+1m+5s since start) Process: 21264, Thread: 21264, cudnn_handle: NULL, cudnn_stream: NULL.
This logging will include the public API name and arguments or attributes (since v8) used in the API call to confirm what cuDNN has received. It also tries to provide information essential to help the users’ understanding of the sequence of events, such as time, process, thread, and stream.
With severity level 2
or 3
, the log will include warnings, as shown below:
CuDNN (v90100) function cudnnBackendFinalize() called: Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: Embedding dim per head for q and k is not a multiple of 8 at: d_qk % 8 != 0 Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: generate_attention_desc() Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: generate_flash_fprop_mha_fort() Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: generate_mha_fort() Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: construct_fort_tree() Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: kernelGeneration() Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: ptr.isSupported() Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: engine_post_checks(*engine_iface, engine.getPerfKnobs(), req_size, engine.getTargetSMCount()) Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: finalize_internal() Time: 2024-03-07T14:56:33.925524 (0d+0h+0m+0s since start) Process=1098828; Thread=1098828; GPU=NULL; Handle=NULL; StreamId=NULL.
In addition to the status code enum name for the warning or error, the log includes a reason message. For many warnings and errors, the library will provide a more explanatory message, like the first warning in the preceding example. Other reason messages may be auto-generated from the cuDNN source code, and may be harder to interpret. Over time, we plan to add more explanatory messages throughout the codebase.
A traceback is also included to provide additional coverage where the names of the function calls may also provide hints in case a message in plain English is missing. Within the traceback, each message may have their own severity and will only be reported when the respective severity level is enabled. The traceback messages are printed in the reverse order of the execution so the messages at the top will be the root cause and tend to be more helpful for debugging. The previous example is a warning traceback. The following example is an error traceback.
cuDNN (v8300) function cudnnBackendFinalize() called: Info: Traceback contains 5 message(s) Error: CUDNN_STATUS_BAD_PARAM; reason: out <= 0 Error: CUDNN_STATUS_BAD_PARAM; reason: is_valid_spacial_dim(xSpatialDimA[dim], wSpatialDimA[dim], ySpatialDimA[dim], cDesc.getPadLowerA()[dim], cDesc.getPadUpperA()[dim], cDesc.getStrideA()[dim], cDesc.getDilationA()[dim]) Error: CUDNN_STATUS_BAD_PARAM; reason: is_valid_convolution(xDesc, wDesc, cDesc, yDesc) Error: CUDNN_STATUS_BAD_PARAM; reason: convolution.init(xDesc, wDesc, cDesc, yDesc) Error: CUDNN_STATUS_BAD_PARAM; reason: finalize_internal() Time: 2021-10-05T17:11:07.935640 (0d+0h+0m+15s since start) Process=87720; Thread=87720; GPU=NULL; Handle=NULL; StreamId=NULL.
There are two methods, as described below, to enable the error/warning reporting and API logging. For convenience, the log output can be handled by the built-in default callback function, which will direct the output to a log file or the standard I/O as designated by the user. The user may also write their own callback function to handle this information programmably, and use the cudnnSetCallback() to pass in the function pointer of their own callback function.
Method 1: Using Environment Variables
To enable API logging using environment variables, follow these steps:
Set the environment variables
CUDNN_LOGLEVEL_DBG
to0
,1
,2
, or3
.Set the environment variable
CUDNN_LOGDEST_DBG
to eitherNULL
,stdout
,stderr
, or a user-desired file path. For example,/home/userName1/log.txt
.Include the conversion specifiers in the file name. For example:
To include date and time in the file name, use the date and time conversion specifiers:
log_%Y_%m_%d_%H_%M_%S.txt
. The conversion specifiers will be automatically replaced with the date and time when the program is initiated, resulting inlog_2017_11_21_09_41_00.txt
.To include the process id in the file name, use the
%i
conversion specifier:log_%Y_%m_%d_%H_%M_%S_%i.txt
for the result:log_2017_11_21_09_41_00_21264.txt
when the process id is 21264. When you have several processes running, using the process id conversion specifier will prevent these processes from writing to the same file at the same time.
Note
The supported conversion specifiers are similar to the
strftime
function. If the file already exists, the log will overwrite the existing file.
These environmental variables are only checked once at the initialization. Any subsequent changes in these environmental variables will not be effective in the current run. Also note that these environment settings can be overridden by Method 2 below.
Environment Variable |
|
|
---|---|---|
|
No logging output and no performance loss |
Logging to |
|
No logging output and no performance loss |
No logging output and no performance loss |
|
No logging output and no performance loss |
Logging to |
|
No logging output and no performance loss |
Logging to |
Method 2: Using the API
To use API function calls to enable API logging, refer to the API description of cudnnSetCallback() and cudnnGetCallback().
cudnnGetLastErrorString() fetches and clears the latest error message. Inside the cuDNN library, the messages are stored in thread local buffers.