Troubleshooting
The following sections help answer the most commonly asked questions regarding typical use cases.
Error Reporting And API Logging
The cuDNN error reporting and API logging is a utility for recording the cuDNN API execution and error information. For each cuDNN API function call, all input parameters are reported in the API logging. If errors occur during the execution of the cuDNN API, a traceback of the error conditions can also be reported to help troubleshooting. This functionality is disabled by default, and can be enabled by setting: CUDNN_LOGDEST_DBG and CUDNN_LOGLEVEL_DBG
. This section provides more details on how to set up this logging.
CUDNN_LOGLEVEL_DBG
can be set to one of four levels which are aligned with the numerical values of enum cudnnSeverity_t. The numerical values accepted are:
0
: no error and no performance penalty1
: error2
: warning and error3
: info, warning and error
Out of bounds integer numerical values are clipped to either 0
or 3
, whichever is the closest.
Notice that the logging levels are nested so that the more severe levels are included by enabling the less severe levels. Legacy logging environment variables CUDNN_LOGERR_DBG
, CUDNN_LOGWARN_DBG
, and CUDNN_LOGINFO_DBG
are deprecated and will continue to work during the v9 grace period for compatibility.
The log output contains variable names, data types, parameter values, device pointers, process ID, thread ID, cuDNN handle, CUDA stream ID, and metadata such as time of the function call in microseconds.
With severity level 3
, the log will include an entry after each public API call (that is, “info” logging), as in this example:
cuDNN (v90000) function cudnnSetActivationDescriptor() called: mode: type=cudnnActivationMode_t; val=CUDNN_ACTIVATION_RELU (1); reluNanOpt: type=cudnnNanPropagation_t; val=CUDNN_NOT_PROPAGATE_NAN (0); coef: type=double; val=1000.000000; Time: 2024-01-21T14:14:21.366171 (0d+0h+1m+5s since start) Process: 21264, Thread: 21264, cudnn_handle: NULL, cudnn_stream: NULL.
This logging will include the public API name and arguments or attributes (since v8) used in the API call to confirm what cuDNN has received. It also tries to provide information essential to help the users’ understanding of the sequence of events, such as time, process, thread, and stream.
With severity level 2
or 3
, the log will include warnings, as shown below:
CuDNN (v90100) function cudnnBackendFinalize() called: Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: Embedding dim per head for q and k is not a multiple of 8 at: d_qk % 8 != 0 Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: generate_attention_desc() Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: generate_flash_fprop_mha_fort() Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: generate_mha_fort() Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: construct_fort_tree() Warning: CUDNN_STATUS_NOT_SUPPORTED_SHAPE; Reason: kernelGeneration() Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: ptr.isSupported() Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: engine_post_checks(*engine_iface, engine.getPerfKnobs(), req_size, engine.getTargetSMCount()) Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: finalize_internal() Time: 2024-03-07T14:56:33.925524 (0d+0h+0m+0s since start) Process=1098828; Thread=1098828; GPU=NULL; Handle=NULL; StreamId=NULL.
In addition to the status code enum name for the warning or error, the log includes a reason message. For many warnings and errors, the library will provide a more explanatory message, like the first warning in the preceding example. Other reason messages may be auto-generated from the cuDNN source code, and may be harder to interpret. Over time, we plan to add more explanatory messages throughout the codebase.
A traceback is also included to provide additional coverage where the names of the function calls may also provide hints in case a message in plain English is missing. Within the traceback, each message may have their own severity and will only be reported when the respective severity level is enabled. The traceback messages are printed in the reverse order of the execution so the messages at the top will be the root cause and tend to be more helpful for debugging. The previous example is a warning traceback. The following example is an error traceback.
cuDNN (v8300) function cudnnBackendFinalize() called: Info: Traceback contains 5 message(s) Error: CUDNN_STATUS_BAD_PARAM; reason: out <= 0 Error: CUDNN_STATUS_BAD_PARAM; reason: is_valid_spacial_dim(xSpatialDimA[dim], wSpatialDimA[dim], ySpatialDimA[dim], cDesc.getPadLowerA()[dim], cDesc.getPadUpperA()[dim], cDesc.getStrideA()[dim], cDesc.getDilationA()[dim]) Error: CUDNN_STATUS_BAD_PARAM; reason: is_valid_convolution(xDesc, wDesc, cDesc, yDesc) Error: CUDNN_STATUS_BAD_PARAM; reason: convolution.init(xDesc, wDesc, cDesc, yDesc) Error: CUDNN_STATUS_BAD_PARAM; reason: finalize_internal() Time: 2021-10-05T17:11:07.935640 (0d+0h+0m+15s since start) Process=87720; Thread=87720; GPU=NULL; Handle=NULL; StreamId=NULL.
There are two methods, as described below, to enable the error/warning reporting and API logging. For convenience, the log output can be handled by the built-in default callback function, which will direct the output to a log file or the standard I/O as designated by the user. The user may also write their own callback function to handle this information programmably, and use the cudnnSetCallback() to pass in the function pointer of their own callback function.
Method 1: Using Environment Variables
To enable API logging using environment variables, follow these steps:
Set the environment variables
CUDNN_LOGLEVEL_DBG
to0
,1
,2
, or3
.Set the environment variable
CUDNN_LOGDEST_DBG
to eitherNULL
,stdout
,stderr
, or a user-desired file path. For example,/home/userName1/log.txt
.Include the conversion specifiers in the file name. For example:
To include date and time in the file name, use the date and time conversion specifiers:
log_%Y_%m_%d_%H_%M_%S.txt
. The conversion specifiers will be automatically replaced with the date and time when the program is initiated, resulting inlog_2017_11_21_09_41_00.txt
.To include the process id in the file name, use the
%i
conversion specifier:log_%Y_%m_%d_%H_%M_%S_%i.txt
for the result:log_2017_11_21_09_41_00_21264.txt
when the process id is 21264. When you have several processes running, using the process id conversion specifier will prevent these processes from writing to the same file at the same time.
Note
The supported conversion specifiers are similar to the
strftime
function. If the file already exists, the log will overwrite the existing file.
These environmental variables are only checked once at the initialization. Any subsequent changes in these environmental variables will not be effective in the current run. Also note that these environment settings can be overridden by Method 2 below.
Environment Variable |
|
|
---|---|---|
|
No logging output and no performance loss |
Logging to |
|
No logging output and no performance loss |
No logging output and no performance loss |
|
No logging output and no performance loss |
Logging to |
|
No logging output and no performance loss |
Logging to |
Method 2: Using the API
To use API function calls to enable API logging, refer to the API description of cudnnSetCallback() and cudnnGetCallback().
cudnnGetLastErrorString() fetches and clears the latest error message. Inside the cuDNN library, the messages are stored in thread local buffers.
FAQs
- Q: I’m not sure if I should use cuDNN for inference or training. How does it compare with TensorRT?
A: cuDNN provides the building blocks for common routines such as convolution, matmul, normalization, attention, pooling, activation and RNN/LSTMs. You can use cuDNN for both training and inference. However, where it differs from TensorRT is that the latter (TensorRT) is a programmable inference accelerator; just like a framework. TensorRT sees the whole graph and optimizes the network by fusing/combining layers and optimizing kernel selection for improved latency, throughout, power efficiency and for reducing memory requirements.
A rule of thumb you can apply is to check out TensorRT, see if it meets your inference needs, if it doesn’t, then look at cuDNN for a lower level abstration.
- Q: How does heuristics in cuDNN work? How does it know what is the optimal solution for a given problem?
A: NVIDIA actively monitors the Deep Learning space for important problem specifications such as commonly used models. The heuristics are produced by sampling a portion of these problem specifications with available computational choices. Over time, more models are discovered and incorporated into the heuristics.
- Q: Is cuDNN going to support running arbitrary graphs?
A: No, we don’t plan to become a framework and execute the whole graph one op at a time. At this time, we are focused on a subgraph given by the user, where we try to produce an optimized fusion kernel. We have documented the rules regarding what can be fused and what cannot. The goal is to support general and flexible fusion. If the current release cannot support your use case, share your use cases with us and we will try to improve it in a later release.
- Q: What’s the difference between TensorRT, TensorFlow/XLA’s fusion, and cuDNN’s fusion?
A: TensorRT and TensorFlow are frameworks; they see the whole graph and can do global optimization, however, they generally only fuse pointwise ops together or pattern match to a limited set of pre-compiled fixed fusion patterns like conv-bias-relu. On the other hand, cuDNN targets a subgraph, but can fuse convolutions with pointwise ops, thus providing potentially better performance. cuDNN fusion kernels can be utilized by TensorRT and TensorFlow/XLA as part of their global graph optimization.
- Q: Can I write an application that calls cuDNN directly?
A: Yes, you can call the C/C++ API directly. Usually, data scientists would wait for framework integration and use the Python API which is more convenient. However, if your use case requires better performance, you can target the cuDNN API directly.
- Q: How does mixed precision training work?
A: Several components need to work together to make mixed precision training possible. cuDNN needs to support the layers with the required datatype config and have optimized kernels that run very fast. In addition, there is a module called automatic mixed precision (AMP) in frameworks which intelligently decides which op can run in a lower precision without affecting convergence and minimize the number of type conversions/transposes in the entire graph. These work together to give you speed up. For more information, refer to Mixed Precision Numerical Accuracy.
- Q: How can I pick the fastest convolution kernels with cuDNN version 9.0.0?
A: In the cuDNN Graph API introduced sincecuDNN v8, convolution kernels are grouped by similar computation and numerical properties into engines. Every engine has a queryable set of performance tuning knobs. A computation case such as a convolution operation graph can be computed using different valid combinations of engines and their knobs, known as an engine configuration. Users can query an array of engine configurations for any given computation case ordered by performance, from fastest to slowest according to cuDNN’s own heuristics. Alternately, users can generate all possible engine configurations by querying the engine count and available knobs for each engine. This generated list could be used for auto-tuning or the user could create their own heuristics.
- Q: Why is cuDNN version 9.0 convolution API call much slower on the first call than subsequent calls?
A: Due to the library split, cuDNN version 9.0 API will only load the necessary kernels on the first API call that requires it. In previous versions prior to 8.0, this load would have been observed in the first cuDNN API call that triggers CUDA context initialization, typically
cudnnCreate()
. Starting in version 8.0, this is delayed until the first sub-library call that triggers CUDA context initialization. Users who desire to have CUDA context preloaded can call the newcudnn*VersionCheck
API, which has the side effect of initializing a CUDA context. This will reduce the run time for all subsequent API calls.- Q: How do I build the cuDNN version 9.0.0 split library?
A: cuDNN v9.0 library is split into multiple sub-libraries. Each library contains a subset of the APIs. Users can link directly against the individual libraries or link with a
dlopen
layer which follows a plugin architecture.To link against an individual library, users can directly specify it and its dependencies on the linker command line. For example, for linking with just the graph API part of the library:
-lcudnn_graph
Alternatively, the user can continue to link against a shim layer (
-libcudnn
) which can dlopen the correct library that provides the implementation of the function. When the function is called for the first time, the dynamic loading of the library takes place.Linker argument is
-lcudnn
.
Support
Support, resources, and information about cuDNN can be found online at https://developer.nvidia.com/cudnn. This includes downloads, webinars, NVIDIA Developer Forums, and more.
We appreciate all types of feedback. Consider posting on the forums with questions, comments, feature requests, and suspected bugs that are appropriate to discuss publicly. cuDNN-related posts are reviewed by the cuDNN engineering team, and internally we will file bugs where appropriate. It’s helpful if you can paste or attach an Error Reporting And API Logging to help us reproduce.
External users can also file bugs directly by following these steps:
Register for the NVIDIA Developer website.
Log in to the developer site.
Click on your name in the upper right corner.
Click My account > My Bugs and select Submit a New Bug.
Fill out the bug reporting page. Be descriptive and if possible, provide the steps that you are following to help reproduce the problem. If possible, paste or attach an API log.
Click Submit a bug.