Odds and Ends#

This section includes a random set of topics and concepts.

cuDNN Library Configuration#

cuDNN is delivered as a collection of sub-libraries. The default usage of cuDNN requires all sub-libraries; however, there are some sub-libraries that can be dropped and cuDNN will still work, saving binary size with some reduction in support surface and performance. Here, we document which subsets of sub-libraries are valid configurations of the library.

A configuration is enabled through an environment variable, CUDNN_LIB_CONFIG. The following table lists the valid configurations. With the environment variable set appropriately, you can remove all but the required sub-libraries from your installation of cuDNN. Over time, we expect to support more configurations.

cuDNN Sub-Library Configuration#

CUDNN_LIB_CONFIG Setting

Summary

Required Sub-Libraries

FULL

Default usage of cuDNN

All

GRAPH_JIT_ONLY

Support is limited to the graph API, using only JIT engines (no precompiled kernels).

libcudnn.so, libcudnn_graph.so, libcudnn_engines_runtime_compiled.so - does not work with static libs.

The library overview section of the API reference has more details on the sub-library breakdown and dependency graph.

FULL Configuration#

This is the default usage of cuDNN. We expect most users to continue using this, unless they want to explore tradeoffs for binary size reduction.

GRAPH_JIT_ONLY Configuration#

This configuration provides:

  • Significant binary size reduction vs FULL

  • JIT engines only:

  • Support for GPU architectures NVIDIA Ampere and later

  • Functional coverage of the graph API (with some caveats below)

  • Performance approaching FULL in many cases, with caveats listed below

Because the GRAPH_JIT_ONLY configuration does not include engines with precompiled kernels, which reside in the largest sub-library, it has significantly smaller binary size vs FULL. To achieve this reduction, retain only the libraries in the Required Sub-Libraries column in the table above, and remove the remaining, non-required, cuDNN sub-libraries.

The tradeoff for this reduced binary size is that, depending on use case, you may see functional or performance gaps vs FULL. In particular, there are cases where you may see a runtime error due to missing functional support. In particular, this may happen for graph patterns supported by the Specialized Pre-compiled Engines. All other graph patterns supported by FULL are also supported by GRAPH_JIT_ONLY.

Also, for functionally supported patterns, there may be a performance difference between FULL and GRAPH_JIT_ONLY configs, depending on the graph pattern and problem size. As a rough guidance:

  • Single-op ConvolutionBwdFilter has large performance gaps, on average.

  • Single-op Matmul and single-op ConvolutionBwdData have small performance gaps, on average.

  • Single-op ConvolutionFwd is at performance parity on average, but you may see performance gaps depending on the problem size.

  • Multi-op fusions should be at performance parity when moving to GRAPH_JIT_ONLY

This is an initial version of GRAPH_JIT_ONLY. We are actively working on closing gaps with FULL. In the future, we expect GRAPH_JIT_ONLY to have the same functional coverage of the graph API as FULL, with some performance gaps for some specialized cases that have highly optimized pre-compiled kernels in FULL only.

Note

GRAPH_JIT_ONLY supports NVIDIA Ampere and later, but does not support GPU architectures prior to Ampere.

Sub-Library Configuration Instructions#

Because the FULL configuration is the default, there’s no action needed to enable it.

To use a non-default configuration (and currently, the only one is GRAPH_JIT_ONLY), do the following:

  • Set the environment variable CUDNN_LIB_CONFIG to GRAPH_JIT_ONLY.

  • Ensure the required sub-libraries are in the LD_LIBRARY_PATH (or PATH on Windows).

Note

The GRAPH_JIT_ONLY configuration is only supported with the cuDNN dynamic libraries and does not currently work with static libraries. We are working on adding this support.

Thread Safety#

The cuDNN library is thread-safe. Its functions can be called from multiple host threads, so long as the threads do not share the same cuDNN handle simultaneously.

When creating a per-thread cuDNN handle, it is recommended that a single synchronous call of cudnnCreate() be made first before each thread creates its own handle asynchronously.

Per cudnnCreate(), for multi-threaded applications that use the same device from different threads, the recommended programming model is to create one (or a few, as is convenient) cuDNN handles per thread and use that cuDNN handle for the entire life of the thread.

Reproducibility (Determinism)#

By design, most of cuDNN’s routines from a given version generate the same bit-wise results across runs when executed on GPUs with the same architecture. There are some exceptions. For example, the following routines do not guarantee reproducibility across runs, even on the same architecture, because they use atomic operations in a way that introduces truly random floating point rounding errors:

  • cudnnConvolutionBackwardFilter when CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 or CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3 is used

  • cudnnConvolutionBackwardData when CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 is used

  • cudnnPoolingBackward when CUDNN_POOLING_MAX is used

  • cudnnSpatialTfSamplerBackward

  • cudnnCTCLoss and cudnnCTCLoss_v8 when CUDNN_CTC_LOSS_ALGO_NON_DETERMINSTIC is used

Across different architectures, no cuDNN routines guarantee bitwise reproducibility. For example, there is no guarantee of bitwise reproducibility when comparing the same routine run on NVIDIA Volta and NVIDIA Turing architecture.

Scaling Parameters#

Many cuDNN routines like cudnnConvolutionForward() accept pointers in host memory to scaling factors alpha and beta. These scaling factors are used to blend the computed values with the prior values in the destination tensor as follows:

dstValue = alpha*computedValue + beta*priorDstValue

The dstValue is written to after being read.

Scaling Parameters for Convolution

When beta is zero, the output is not read and may contain uninitialized data (including NaN).

These parameters are passed using a host memory pointer. The storage data types for alpha and beta are:

  • float for HALF and FLOAT tensors, and

  • double for DOUBLE tensors.

For improved performance use beta = 0.0. Use a non-zero value for beta only when you need to blend the current output tensor values with the prior values of the output tensor.

Type Conversion

When the data input x, the filter input w, and the output y are all in INT8 data type, the function cudnnConvolutionBiasActivationForward() will perform the type conversion. Accumulators are 32-bit integers that wrap on overflow.

INT8 for cudnnConvolutionBiasActivationForward

Deprecation Policy#

cuDNN uses a streamlined, two-step, deprecation policy for all API and enum changes to enable a fast pace of innovation:
  • Step 1: Label for deprecation
    • The current major version marks an API function or enum as deprecated without changing its behavior.

    • A deprecated enum value is marked with the CUDNN_DEPRECATED_ENUM macro.

      • If it is simply renamed, the old name will map to the new name, and the old name will be marked with the CUDNN_DEPRECATED_ENUM macro.

    • A deprecated API function is marked with the CUDNN_DEPRECATED macro.

  • Step 2: Removal
    • The next major version removes the deprecated API function or enum value and its name is never reused.

This depreciation scheme allows us to retire the deprecated API in just one major release. Functionality that is deprecated in the current major release can be compiled without any changes.The backward compatibility ends when another major cuDNN release is introduced.

Prototypes of deprecated functions will be prepended in cuDNNs headers using the CUDNN_DEPRECATED macro. When the -DCUDNN_WARN_DEPRECATED switch is passed to the compiler, any deprecated function call in your code will emit a compiler warning, for example:

warning: 'cudnnStatus_t cudnnRNNSetClip_v8(cudnnRNNDescriptor_t, cudnnRNNClipMode_t, ...)' is deprecated [-Wdeprecated-declarations]

or

warning C4996: 'cudnnRNNSetClip_v8': was declared deprecated

The above warnings are disabled by default to avoid potential build breaks in software setups where compiler warnings are treated as errors.

Similarly, for deprecated enum values, the compiler emits a warning when attempting to use a deprecated value:

warning: 'EXAMPLE_VAL' is deprecated: value not allowed [-Wdeprecated-declarations]

or

warning  C4996: 'EXAMPLE_VAL': was declared deprecated

Special Case: API Behavior Change

To help ease the transition and avoid any surprises to developers, a behavior change between two major versions of a specific API function is accommodated by suffixing the function with a _v tag followed by the current, major cuDNN version. In the next major release, the deprecated function is removed, and its name is never reused. (Brand-new APIs are first introduced without the _v tag).

Updating a function’s behavior in this way uses the APIs name to embed the cuDNN version where the API call was modified. As a result, the API changes will be easier to track and document.

Let us explain this process through an example using two subsequent, major cuDNN releases, version 8 and 9. In this example, an API function foo() changes its behavior from cuDNN v7 to cuDNN v8.

Major Release 8

The updated API is introduced as foo_v8(). The deprecated API foo() is kept unchanged to maintain backward compatibility until the next major release.

Major Release 9

The deprecated API foo() is permanently removed and its name is not reused. The foo_v8() function supersedes the retired call foo().

GPU And Driver Requirements#

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Conventions And Features For Convolutions#

The convolution functions are:

Convolution Formulas#

This section describes the various convolution formulas implemented in cuDNN convolution functions for the cudnnConvolutionForward() path.

The convolution terms described in the table below apply to all the convolution formulas that follow.

Convolution Terms#

Term

Description

x

Input (image) Tensor

w

Weight Tensor

y

Output Tensor

n

Current Batch Size

c

Current Input Channel

C

Total Input Channels

H

Input Image Height

W

Input Image Width

k

Current Output Channel

K

Total Output Channels

p

Current Output Height Position

q

Current Output Width Position

G

Group Count

pad

Padding Value

u

Vertical Subsample Stride (along Height)

v

Horizontal Subsample Stride (along Width)

dil h

Vertical Dilation (along Height)

dil w

Horizontal Dilation (along Width)

r

Current Filter Height

R

Total Filter Height

s

Current Filter Width

S

Total Filter Width

C g

\(\frac{C}{G}\)

K g

\(\frac{K}{G}\)

Convolution (convolution mode set to CUDNN_CROSS_CORRELATION)

\(y_{n,k,p,q}=\sum_{c}^{C}\sum_{r}^{R}\sum_{s}^{S} x_{n,c,p+r,q+s}\times W_{k,c,r,s}\)

Convolution with Padding

\(x_{\lt 0,\lt 0}=0\) \(x_{\gt H,\gt W}=0\) \(y_{n,k,p,q}=\sum_{c}^{C}\sum_{r}^{R}\sum_{s}^{S} x_{n,c,p+r-pad,q+s-pad}\times W_{k,c,r,s}\)

Convolution with Subsample-Striding

\(y_{n,k,p,q}=\sum_{c}^{C}\sum_{r}^{R}\sum_{s}^{S} x_{n,c,\left( p\ast u \right)+r,\left( q\ast v \right)+s}\times W_{k,c,r,s}\)

Convolution with Dilation

\(y_{n,k,p,q}=\sum_{c}^{C}\sum_{r}^{R}\sum_{s}^{S} x_{n,c,p+\left( r\ast dil_{h} \right),q+\left( s\ast dil_{w} \right)}\times W_{k,c,r,s}\)

Convolution (convolution mode set to CUDNN_CONVOLUTION)

\(y_{n,k,p,q}=\sum_{c}^{C}\sum_{r}^{R}\sum_{s}^{S} x_{n,c,p+r,q+s}\times W_{k,c,R-r-1,S-s-1}\)

Convolution using Grouped Convolution

\(C_{g}=\frac{C}{G}\) \(K_{g}=\frac{K}{G}\) \(y_{n,k,p,q}=\sum_{c}^{C_{g}}\sum_{r}^{R}\sum_{s}^{S} x_{n,C_{g}\ast floor\left( k/K_{g} \right)+c,p+r,q+s}\times W_{k,c,r,s}\)

Grouped Convolutions#

cuDNN supports grouped convolutions by setting groupCount > 1 for the convolution descriptor convDesc, using cudnnSetConvolutionGroupCount(). By default, the convolution descriptor convDesc is set to groupCount of 1.

Basic Idea

Conceptually, in grouped convolutions, the input channels and the filter channels are split into a groupCount number of independent groups, with each group having a reduced number of channels. The convolution operation is then performed separately on these input and filter groups. For example, consider the following: if the number of input channels is 4, and the number of filter channels is 12. For a normal, ungrouped convolution, the number of computation operations performed are 12*4.

If the groupCount is set to 2, then there are now two input channel groups of two input channels each, and two filter channel groups of six filter channels each. As a result, each grouped convolution will now perform 2*6 computation operations, and two such grouped convolutions are performed. Hence the computation savings are 2x: (12*4)/(2*(2*6)).

cuDNN Grouped Convolution

When using groupCount for grouped convolutions, you must still define all tensor descriptors so that they describe the size of the entire convolution, instead of specifying the sizes per group.

Grouped convolutions are supported for all formats that are currently supported by the functions cudnnConvolutionForward(), cudnnConvolutionBackwardData(), and cudnnConvolutionBackwardFilter().

The tensor stridings that are set for groupCount of 1 are also valid for any group count.

By default, the convolution descriptor convDesc is set to groupCount of 1. Refer to Convolution Formulas for the math behind the cuDNN grouped convolution.

Example

Below is an example showing the dimensions and strides for grouped convolutions for NCHW format, for 2D convolution. The symbols * and / are used to indicate multiplication and division.

xDesc or dxDesc

  • Dimensions: [batch_size, input_channel, x_height, x_width]

  • Strides: [input_channels*x_height*x_width, x_height*x_width, x_width, 1]

wDesc or dwDesc

  • Dimensions: [output_channels, input_channels/groupCount, w_height, w_width]

  • Format: NCHW

convDesc

  • Group Count: groupCount

yDesc or dyDesc

  • Dimensions: [batch_size, output_channels, y_height, y_width]

  • Strides: [output_channels*y_height*y_width, y_height*y_width, y_width, 1]

Best Practices for 3D Convolutions#

Attention

These guidelines are applicable to 3D convolution and deconvolution functions starting in NVIDIA cuDNN v7.6.3.

The following guidelines are for setting the cuDNN library parameters to enhance the performance of 3D convolutions. Specifically, these guidelines are focused on settings such as filter sizes, padding and dilation settings. Additionally, an application-specific use-case, namely, medical imaging, is presented to demonstrate the performance enhancement of 3D convolutions with these recommended settings.

Specifically, these guidelines are applicable to the following functions and their associated data types:

For more information, refer to the cuDNN API Reference.

Limitations#

Your application will be functional but could be less performant if the model has channel counts lower than 32 (gets worse the lower it is). If the above is in the network, use cuDNNFind* to get the best option.

Environment Variables#

cuDNNs behavior can be influenced through a set of environment variables. The following environment variables are officially supported by cuDNN:

Environment Variables#

Environment Variable

Description and Usage

NVIDIA_TF32_OVERRIDE

Graph API

CUDNN_LOGDEST_DBG

Troubleshooting

CUDNN_LOGLEVEL_DBG

Troubleshooting

CUDNN_LOGINFO_DBG

Troubleshooting - Deprecated

CUDNN_LOGWARN_DBG

Troubleshooting - Deprecated

CUDNN_LOGERR_DBG

Troubleshooting - Deprecated

CUDNN_LIB_CONFIG

cuDNN Sub-Library Configuration

CUDNN_FORWARD_COMPAT_DISABLE

cuDNN Hardware Forward Compatibility

Note

Except for the environment variables listed above, we provide no support or guarantee on the use of any other environment variables prefixed by CUDNN_.

SM Carveout#

Starting in cuDNN 8.9.5, SM carveout is supported on NVIDIA Hopper GPUs, allowing expert users to reserve SMs for concurrent execution on a separate CUDA stream. Users can set a target SM count to cuDNN heuristics, and get a list of engine configs that will use that number of SMs during execution. For advanced use cases without cuDNN heuristics, users can also create the engine config from scratch with the SM carveout configured (the engines that support this feature are listed in the table below).

The following code snippet is a sample for heuristics use cases.

// Create heuristics descriptor
cudnnBackendDescriptor_t engHeur;
cudnnBackendCreateDescriptor(CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR, &engHeur);
cudnnBackendSetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_OPERATION_GRAPH, CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &opGraph);
cudnnBackendSetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_MODE, CUDNN_TYPE_HEUR_MODE, 1, &heurMode);
// SM carveout
int32_t targetSMCount = 66;
cudnnBackendSetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_SM_COUNT_TARGET, CUDNN_TYPE_INT32, 1, &targetSMCount);
cudnnBackendFinalize(engHeur);
// Create engine config descriptor
cudnnBackendDescriptor_t engConfig;
cudnnBackendCreateDescriptor(CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, &engConfig);
// Retrieve optimal engine config(s) from heuristics
cudnnBackendGetAttribute(engHeur, CUDNN_ATTR_ENGINEHEUR_RESULTS, CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &returnedCount, engConfig);
// "engConfig" should now be ready with target SM count as 66

This feature is currently supported by normal convolutions (Fprop, Dgrad, and Wgrad) as well as the Conv-Bias-Act fusions.

cuDNN Backend Engines that Support SM Carveout#

Convolution Forward

Convolution Backward Data

Convolution Backward Filter

cudnnConvolutionBiasActivationForward

  • 6

  • 58

  • 61

  • 62

  • 64

  • 65

  • 66

  • 67

  • 68

  • 69

  • 7

  • 63

  • 66

  • 67

  • 68

  • 69

  • 70

  • 71

  • 72

  • 73

  • 75

  • 76

  • 17

  • 62

  • 64

  • 65

  • 66

  • 68

  • 14

  • 39

  • 40

  • 41

  • 42

  • 43

Version Checking Against CUDNN_VERSION#

The definition of CUDNN_VERSION is:

CUDNN_MAJOR * 10000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL

from

CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL

Therefore, any version checking using CUDNN_VERSION should be updated accordingly. For example, if users want to exercise a code path only if cuDNN is greater or equal to 9.0.0, they will have to use the macro conditionals such as CUDNN_VERSION >= 90000 rather than CUDNN_VERSION >= 9000.

cuDNN Symbol Server#

Obfuscated symbols for cuDNN libraries that are being debugged or profiled in your application can be downloaded from a repository of symbols for Linux. The repository hosts symbol files (.sym) that contain obscured symbol names (debug data is not distributed).

When an issue appears with a cuDNN API, using the symbol server to symbolize its stack trace can help speed up the debug process.

There are two recommended ways to use obfuscated symbols for each cuDNN library with the GNU Debugger (GDB):

  • By unstripping the library

  • By deploying the .sym file as a separate debug information file

The following code illustrates the recommended ways to use obfuscated symbols on x86_64 Ubuntu 22.04:

# Determine the Build ID of the library
$ readelf -n /usr/lib/x86_64-linux-gnu/libcudnn_graph.so

# ... Build ID: 457c8f5dea095b0f90af2abddfcb69946df61b76
# Browse to https://cudatoolkit-symbols.nvidia.com/libcudnn_graph.so/457c8f5dea095b0f90af2abddfcb69946df61b76/index.html to determine .sym filename to download
$ wget https://cudatoolkit-symbols.nvidia.com/libcudnn_graph.so/457c8f5dea095b0f90af2abddfcb69946df61b76/libcudnn_graph.so.9.0.0.sym

# Then with appropriate permissions, either unstrip,
$ eu-unstrip /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0 libcudnn_graph.so.9.0.0.sym -o /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0

# Or, with appropriate permissions, deploy as a separate debug info file
# By splitting the Build ID into two parts, with the first two characters as the directory
# And the remaining characters as the filename with the ".debug" extension
$ cp libcudnn_graph.so.9.0.0.sym /usr/lib/debug/.build-id/45/7c8f5dea095b0f90af2abddfcb69946df61b76.debug

Example: Symbolizing#

Here is a simplified example to show the uses of symbolizing. A sample application named test_shared calls the cuDNN API cudnnDestroy() which leads to a segmentation fault. With a default install of cuDNN and no obscured symbols, the output in GDB might look like the following:

Thread 1 "test_shared" received signal SIGSEGV, Segmentation fault.
0x00007ffff7a4ac01 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9
(gdb) bt
#0  0x00007ffff7a4ac01 in ?? () from /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9
#1  0x00007ffff7a4c919 in cudnnDestroy () from /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9
#2  0x00000000004007b7 in main ()

After applying the obscured symbols in one of the ways described earlier, the stack trace will look like the following:

Thread 1 "test_shared" received signal SIGSEGV, Segmentation fault.
0x00007ffff7a4ac01 in libcudnn_graph_148ced18265f5231d89551dcbdcf5cf3fe6d77d1 () from /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9
(gdb) bt
#0  0x00007ffff7a4ac01 in libcudnn_graph_148ced18265f5231d89551dcbdcf5cf3fe6d77d1 () from /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9
#1  0x00007ffff7a4c919 in cudnnDestroy () from /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9
#2  0x00000000004007b7 in main ()

The symbolized call stack can then be documented as part of a bug description provided to NVIDIA for analysis.

API Usage During Initialization and Termination#

cuDNN uses a global state that is initialized during host program initiation and destroyed during host program termination. cuDNN cannot detect if this state is invalid, so using any of these interfaces (implicitly or explicitly) during program initiation or termination after main() may result in undefined behavior. CUDA has the same restriction.