Troubleshooting

The following sections help answer the most commonly asked questions regarding typical use cases.

Error Reporting And API Logging

The cuDNN error reporting and API logging is a utility for recording the cuDNN API execution and error information. For each cuDNN API function call, all input parameters are reported in the API logging. If errors occur during the execution of the cuDNN API, a traceback of the error conditions can also be reported to help troubleshooting. This functionality is disabled by default, and can be enabled using the methods described in the later part of this section using: CUDNN_LOGDEST_DBG and CUDNN_LOGLEVEL_DBG.

Since v9.0.0, to improve visibility and adhere to common practices, the logging levels are nested so that the more severe levels are included by enabling the less severe levels. Legacy logging environment variables CUDNN_LOGERR_DBG, CUDNN_LOGWARN_DBG, and CUDNN_LOGINFO_DBG are deprecated and will continue to work during the v9 grace period for compatibility.

The log output contains variable names, data types, parameter values, device pointers, process ID, thread ID, cuDNN handle, CUDA stream ID, and metadata such as time of the function call in microseconds.

The newly introduced logging environment variable CUDNN_LOGLEVEL_DBG (since v9.0.0) has 4 levels which are aligned with the numerical values of enum cudnnSeverity_t. The numerical values accepted are 0 (no error and no performance penalty), 1 (error), 2 (warning and error), 3 (info, warning, and error). Out of bound integer numerical values are clipped to either 0 or 3, whichever is the closest.

For example, when the severity level CUDNN_LOGLEVEL_DBG=3 is set, the user will receive the API loggings, such as:

cuDNN (v90000) function cudnnSetActivationDescriptor() called:
    mode: type=cudnnActivationMode_t; val=CUDNN_ACTIVATION_RELU (1);
    reluNanOpt: type=cudnnNanPropagation_t; val=CUDNN_NOT_PROPAGATE_NAN (0);
    coef: type=double; val=1000.000000;
Time: 2024-01-21T14:14:21.366171 (0d+0h+1m+5s since start)
Process: 21264, Thread: 21264, cudnn_handle: NULL, cudnn_stream: NULL.

Starting in cuDNN 8.3.0, when the severity levels of warning or error are implied, the log output additionally reports an error traceback such as the example below. This traceback reports the relevant error/warning conditions, aiming to provide the user hints for troubleshooting purposes. Within the traceback, each message may have their own severity and will only be reported when the respective severity level is enabled. The traceback messages are printed in the reverse order of the execution so the messages at the top will be the root cause and tend to be more helpful for debugging.

cuDNN (v8300) function cudnnBackendFinalize() called:
            Info: Traceback contains 5 message(s)
                Error: CUDNN_STATUS_BAD_PARAM; reason: out <= 0
                Error: CUDNN_STATUS_BAD_PARAM; reason: is_valid_spacial_dim(xSpatialDimA[dim], wSpatialDimA[dim], ySpatialDimA[dim], cDesc.getPadLowerA()[dim], cDesc.getPadUpperA()[dim], cDesc.getStrideA()[dim], cDesc.getDilationA()[dim])
                Error: CUDNN_STATUS_BAD_PARAM; reason: is_valid_convolution(xDesc, wDesc, cDesc, yDesc)
                Error: CUDNN_STATUS_BAD_PARAM; reason: convolution.init(xDesc, wDesc, cDesc, yDesc)
                Error: CUDNN_STATUS_BAD_PARAM; reason: finalize_internal()
        Time: 2021-10-05T17:11:07.935640 (0d+0h+0m+15s since start)
        Process=87720; Thread=87720; GPU=NULL; Handle=NULL; StreamId=NULL.

There are two methods, as described below, to enable the error/warning reporting and API logging. For convenience, the log output can be handled by the built-in default callback function, which will direct the output to a log file or the standard I/O as designated by the user. The user may also write their own callback function to handle this information programmably, and use the cudnnSetCallback() to pass in the function pointer of their own callback function.

Method 1: Using Environment Variables

To enable API logging using environment variables, follow these steps:

Set the environment variables CUDNN_LOGLEVEL_DBG to 0, 1, 2, or 3.
Set the environment variable CUDNN_LOGDEST_DBG to either NULL, stdout, stderr, or a user-desired file path. For example, /home/userName1/log.txt.
Include the conversion specifiers in the file name. For example:
1. To include date and time in the file name, use the date and time conversion specifiers: log_%Y_%m_%d_%H_%M_%S.txt. The conversion specifiers will be automatically replaced with the date and time when the program is initiated, resulting in log_2017_11_21_09_41_00.txt.
2. To include the process id in the file name, use the %i conversion specifier: log_%Y_%m_%d_%H_%M_%S_%i.txt for the result: log_2017_11_21_09_41_00_21264.txt when the process id is 21264. When you have several processes running, using the process id conversion specifier will prevent these processes from writing to the same file at the same time.
Note

The supported conversion specifiers are similar to the strftime function. If the file already exists, the log will overwrite the existing file.

These environmental variables are only checked once at the initialization. Any subsequent changes in these environmental variables will not be effective in the current run. Also note that these environment settings can be overridden by Method 2 below.

Impact on the Performance of API Logging Using Environment Variables
Environment Variable	`CUDNN_LOGLEVEL_DBG=0`	`CUDNN_LOGLEVEL_DBG=1` or `CUDNN_LOGLEVEL_DBG=2` or `CUDNN_LOGLEVEL_DBG=3`
`CUDNN_LOGDEST_DBG` not set	No logging output and no performance loss	Logging to `stderr` and some performance loss
`CUDNN_LOGDEST_DBG=NULL`	No logging output and no performance loss	No logging output and no performance loss
`CUDNN_LOGDEST_DBG=stdout` or `stderr`	No logging output and no performance loss	Logging to `stdout` or `stderr` and some performance loss
`CUDNN_LOGDEST_DBG=filename.txt`	No logging output and no performance loss	Logging to `filename.txt` and some performance loss

Method 2: Using the API

To use API function calls to enable API logging, refer to the API description of cudnnSetCallback() and cudnnGetCallback().

Since v9.0.0, a new API cudnnGetLastErrorString() has also been introduced to fetch and clear the latest error message. Inside the cuDNN library, the messages are stored in thread local buffers.

FAQs

Q: I’m not sure if I should use cuDNN for inference or training. How does it compare with TensorRT?

A: cuDNN provides the building blocks for common routines such as convolution, matmul, normalization, attention, pooling, activation and RNN/LSTMs. You can use cuDNN for both training and inference. However, where it differs from TensorRT is that the latter (TensorRT) is a programmable inference accelerator; just like a framework. TensorRT sees the whole graph and optimizes the network by fusing/combining layers and optimizing kernel selection for improved latency, throughout, power efficiency and for reducing memory requirements.

A rule of thumb you can apply is to check out TensorRT, see if it meets your inference needs, if it doesn’t, then look at cuDNN for a lower level abstration.

Q: How does heuristics in cuDNN work? How does it know what is the optimal solution for a given problem?

A: NVIDIA actively monitors the Deep Learning space for important problem specifications such as commonly used models. The heuristics are produced by sampling a portion of these problem specifications with available computational choices. Over time, more models are discovered and incorporated into the heuristics.

Q: Is cuDNN going to support running arbitrary graphs?

A: No, we don’t plan to become a framework and execute the whole graph one op at a time. At this time, we are focused on a subgraph given by the user, where we try to produce an optimized fusion kernel. We have documented the rules regarding what can be fused and what cannot. The goal is to support general and flexible fusion. If the current release cannot support your use case, share your use cases with us and we will try to improve it in a later release.

Q: What’s the difference between TensorRT, TensorFlow/XLA’s fusion, and cuDNN’s fusion?

A: TensorRT and TensorFlow are frameworks; they see the whole graph and can do global optimization, however, they generally only fuse pointwise ops together or pattern match to a limited set of pre-compiled fixed fusion patterns like conv-bias-relu. On the other hand, cuDNN targets a subgraph, but can fuse convolutions with pointwise ops, thus providing potentially better performance. cuDNN fusion kernels can be utilized by TensorRT and TensorFlow/XLA as part of their global graph optimization.

Q: Can I write an application that calls cuDNN directly?

A: Yes, you can call the C/C++ API directly. Usually, data scientists would wait for framework integration and use the Python API which is more convenient. However, if your use case requires better performance, you can target the cuDNN API directly.

Q: How does mixed precision training work?

A: Several components need to work together to make mixed precision training possible. cuDNN needs to support the layers with the required datatype config and have optimized kernels that run very fast. In addition, there is a module called automatic mixed precision (AMP) in frameworks which intelligently decides which op can run in a lower precision without affecting convergence and minimize the number of type conversions/transposes in the entire graph. These work together to give you speed up. For more information, refer to Mixed Precision Numerical Accuracy.

Q: How can I pick the fastest convolution kernels with cuDNN version 9.0.0?

A: In the cuDNN Graph API introduced sincecuDNN v8, convolution kernels are grouped by similar computation and numerical properties into engines. Every engine has a queryable set of performance tuning knobs. A computation case such as a convolution operation graph can be computed using different valid combinations of engines and their knobs, known as an engine configuration. Users can query an array of engine configurations for any given computation case ordered by performance, from fastest to slowest according to cuDNN’s own heuristics. Alternately, users can generate all possible engine configurations by querying the engine count and available knobs for each engine. This generated list could be used for auto-tuning or the user could create their own heuristics.

Q: Why is cuDNN version 9.0 convolution API call much slower on the first call than subsequent calls?

A: Due to the library split, cuDNN version 9.0 API will only load the necessary kernels on the first API call that requires it. In previous versions prior to 8.0, this load would have been observed in the first cuDNN API call that triggers CUDA context initialization, typically cudnnCreate(). Starting in version 8.0, this is delayed until the first sub-library call that triggers CUDA context initialization. Users who desire to have CUDA context preloaded can call the new cudnn*VersionCheck API, which has the side effect of initializing a CUDA context. This will reduce the run time for all subsequent API calls.

Q: How do I build the cuDNN version 9.0.0 split library?

A: cuDNN v9.0 library is split into multiple sub-libraries. Each library contains a subset of the APIs. Users can link directly against the individual libraries or link with a dlopen layer which follows a plugin architecture.

To link against an individual library, users can directly specify it and its dependencies on the linker command line. For example, for linking with just the graph API part of the library: -lcudnn_graph

Alternatively, the user can continue to link against a shim layer (-libcudnn) which can dlopen the correct library that provides the implementation of the function. When the function is called for the first time, the dynamic loading of the library takes place.

Linker argument is -lcudnn.

Support

Support, resources, and information about cuDNN can be found online at https://developer.nvidia.com/cudnn. This includes downloads, webinars, NVIDIA Developer Forums, and more.

We appreciate all types of feedback. Consider posting on the forums with questions, comments, feature requests, and suspected bugs that are appropriate to discuss publicly. cuDNN-related posts are reviewed by the cuDNN engineering team, and internally we will file bugs where appropriate. It’s helpful if you can paste or attach an Error Reporting And API Logging to help us reproduce.

External users can also file bugs directly by following these steps:

Register for the NVIDIA Developer website.
Log in to the developer site.
Click on your name in the upper right corner.
Click My account > My Bugs and select Submit a New Bug.
Fill out the bug reporting page. Be descriptive and if possible, provide the steps that you are following to help reproduce the problem. If possible, paste or attach an API log.
Click Submit a bug.