cuDNN Release 8.x.x

cuDNN Release 8.0.0 Preview

Attention: This is the cuDNN 8.0.0 Preview release. This Preview release is for early testing and feedback, therefore, for production use of cuDNN, continue to use cuDNN 7.6.5. This release is subject to change based on ongoing performance tuning and functional testing. For feedback on the new backend API and deprecations, email
These release notes are applicable to JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).
Note: cuDNN 8.0.0 passed GA quality testing and validation for TensorRT and JetPack users.

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:

cuDNN library
  • The cuDNN library has been split into the following libraries:
    • cudnn_ops_infer - This entity contains the routines related to cuDNN context creation and destruction, tensor descriptor management, tensor utility routines, and the inference portion of common machine learning algorithms such as batch normalization, softmax, dropout, etc.

    • cudnn_ops_train - This entity contains common training routines and algorithms, such as batch normalization, softmax, dropout, etc. The cudnn_ops_train library depends on cudnn_ops_infer.

    • cudnn_cnn_infer - This entity contains all routines related to convolutional neural networks needed at inference time. The cudnn_cnn_infer library depends on cudnn_ops_infer.

    • cudnn_cnn_train - This entity contains all routines related to convolutional neural networks needed during training time. The cudnn_cnn_train library depends on cudnn_ops_infer, cudnn_ops_train, and cudnn_cnn_infer.

    • cudnn_adv_infer - This entity contains all other features and algorithms. This includes RNNs, CTC loss, and multi-head attention. The cudnn_adv_infer library depends on cudnn_ops_infer.

    • cudnn_adv_train - This entity contains all the training counterparts of cudnn_adv_infer. The cudnn_adv_train library depends on cudnn_ops_infer, cudnn_ops_train, and cudnn_adv_infer.

    • cudnn - This is an optional shim layer between the application layer and the cuDNN code. This layer opportunistically opens the correct library for the API at runtime.

  • cuDNN does not support mixing sub-library versions. If there is a mismatch in the cuDNN version numbers in the cuDNN sub-library header files, the build will crash. The versions need to match on the major number and minor number, as well as the patch level.

  • The cuDNN sub-libraries must be installed under a single directory.

Multiple dynamic libraries
In order to link against a subset of cuDNN, you need to know which subset of the API you are using and then link against the appropriate cuDNN sub-components. The cuDNN sub-components are as follows:
cuDNN linking options
There are two different linking options:
  • Linking against individual sub-libraries: Users who link against individual sub-libraries must be able to identify the API exposed by each cuDNN sub-library. Users also need to know the hierarchy of the different cuDNN sub-libraries. Each .so or .a needs to be specified explicitly in the user’s linking command, as well as any external dependencies cuDNN require. For more information, refer to the Limitations section below.

  • Linking against the full cuDNN (compatibility option): This would allow users to use -lcudnn. is provided as a shim layer that would open the appropriate cuDNN sub-library for any particular cuDNN API call. While libcudnn.a is largely unchanged, it is a statically linked file for all of cuDNN.

cuDNN loading options
For users who want a smaller memory footprint, there are 2 ways of loading the library.
  • Cherry-pick loading: Each sub-library is loaded only when accessed. This will cause the first reference to that sub-library to take a long time but will ensure the user isn’t loading more libraries than they need.

  • All access loading: All available cuDNN sub-libraries are loaded early during runtime.

New API functions

For a list of functions and data types that were added in this release, see API Changes For cuDNN 8.0.0.

General Support of CUDA Graph Capture
CUDA Graphs are now supported for all functions in this release; with the following restrictions.
  • CUDA Toolkit 10.2 or higher is required
  • cuDNN 8.0.0 graphs are captured via the CUDA graph-capture APIs
  • any non-default use of textures by users of cuDNN needs to be disabled prior to capture

cuDNN 8.0.0 does not at this time offer API support to add operations to an existing CUDA graph directly; however, the captured graph may be added to an existing graph through the existing CUDA Graphs API.

Regarding texture usage, cuDNN 8.0.0 by default will not enable texture usage; expert users may enable texture usage where allowed, but that usage will prevent a successful CUDA Graph capture until disabled. In order for cuDNN 8.0.0 to be graph-capture compatible library-wide, the cuDNN 8.0.0 CTC API was updated as described elsewhere.

The usual restrictions for CUDA Graphs apply in addition to these restrictions here.

New APIs for convolution

A new set of API functions to provide a brand new approach to cuDNN that offers more fine-grain control of performance, numerical properties, etc.. for convolution. Using this API, users directly access various engines that compute convolution forward propagation, backward data, backward filter, and generic support for fusion starting with a limited support in this cuDNN 8.0.0 release and expanding support in follow-up releases. Each engine has performance tuning knobs such as GEMM tiling and split-K. Users can use this API to fine-tune their network by querying cuDNN’s heuristics, or doing their own, to find the most optimal engine configuration with which cuDNN computes each network layer.

NVIDIA Ampere GPU architecture support (not applicable for Jetson platforms)
  • Added support for A100 GPU based on NVIDIA Ampere architecture.
  • cuDNN 8.0.0 has seen significant improvements when using A100 GPUs compared to Volta V100 with cuDNN 7.6.
  • Added support for Tensor Float 32 (TF32) for 1D and 2D convolutions. Full support for TF32 will come in future releases such as grouped convolutions and 3D convolutions in addition to further performance tuning.
  • Increased performance for the legacy Tensor Cores (mixed precision for 1D, 2D, 3D, and grouped convolutions.
Turing and Volta architecture improvements
  • New kernels for Tensor Cores and heuristics update for 1D convolution resulting in performance improvements for speech networks such as Jasper and Tacotron2 and WaveGlow, in addition to support for grouped 1D conv (QuartzNet).
  • Added 3D convolutions support of NHWC and improved heuristics and kernels for Tensor Cores in NCHW resulting in performance improvements for VNet, UNet-Medical and UNet-Industrial. Additionally, FP16 3D convolutions are supported as well.
  • Better utilization of Tensor Cores and heuristics for grouped convolutions result in improvements for ResNext.
  • More tuning for vision networks like ResNet-50 ([MXNet] [PyTorch] [TensorFlow]) and SSD ([PyTorch] [TensorFlow]) with new updated heuristics.
Operation fusion

Operation fusion can be achieved via the backend API. The general workflow is similar to running unfused operations, except that instead of creating a single operation Operation Graph, the user may specify a multi-operation Operation Graph. For more information, see Operation Fusion Via The Backend API in the cuDNN Developer Guide.

Depthwise convolution extension

We’ve extended the fprop and dgrad NHWC depthwise kernels to support more combinations (filter sizes/strides) such as 5x5/1x1, 5x5/2x2, 7x7/1x1, 7x7/2x2 (in addition to what we already have, 1x1/1x1, 3x3/1x1, 3x3/2x2), which provides good performance.


For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.0.0.


  • Samples must be installed in a writable location, otherwise the samples can crash.

  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.0 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • Some data types are not widely supported by all cuDNN API. For example, CUDNN_DATA_INT8x4 is not supported by many functions. In such cases, support is available by using cudnnTransformTensor() to transform the tensors from the desired type to a type supported by the API. For example, a user is able to transform input tensors from CUDNN_DATA_INT8x4 to CUDNN_DATA_INT8, run the desired API and then transform output tensors from CUDNN_DATA_INT8 to CUDNN_DATA_INT8x4. Note that this transformation will incur an extra round trip to memory.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.

  • Some computational options in cuDNN 8.0.0 now require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 128-bit boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.0 compared to cuDNN v7.6.

  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.0 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.

Deprecated Features

The following features are deprecated in cuDNN 8.0.0:
  • Support for Ubuntu 14.04 has been deprecated in this release. Upgrade to 16.04 or 18.04 for continued support.

  • Support for Mac OS X has been deprecated in this release. Operating systems that are currently supported are Linux and Windows.

  • cuDNN version 8 introduces a new API deprecation policy to enable a faster pace of innovation. A streamlined, two-step, deprecation policy will be used for all API changes starting with cuDNN version 8. For details about this new deprecation policy, see Backward Compatibility And Deprecation Policy in the cuDNN Developer Guide.

  • Removed and deprecated API changes. For a list of removed and deprecated APIs, see API Changes For cuDNN 8.0.0.

Fixed Issues

The following issues have been fixed in this release:

  • There is a known issue in that cudnnDestroy() does not destroy all that cudnnCreate() created. Calling cudnnDestroy() after cudnnCreate() has a memory leak in some tests of about 1.6 MB on host memory. This issue has been fixed in cuDNN 8.0.0.

  • Starting in cuDNN 7.6.1, when using the experimental multi-head attention API, it is possible that the forward and backward paths produce different results for the BERT model, when the batch size is greater than one and/or the number of heads is greater than one. This issue has been fixed in cuDNN 8.0.0.

  • The description of cudnnSetCTCLossDescriptorEx() is not clear. This issue has been fixed in cuDNN 8.0.0.

  • Documentation affecting 1x1 convolution functions is not clear, for example cudnnFindConvolutionBackwardDataAlgorithm(). This issue has been fixed in cuDNN 8.0.0.

  • cuDNN forward convolution with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM does not propagate NANs in weights. This issue has been fixed in cuDNN 8.0.0.

  • Document mathematical definitions of all operations in cuDNN. We include full mathematical descriptions for the convolution functions.

  • The functions cudnnGetConvolutionForwardAlgorithm_v7() and cudnnGetConvolutionForwardWorkspaceSize() may return CUDNN_STATUS_SUCCESS while the execution of the same convolution returns CUDNN_STATUS_NOT_SUPPORTED. Similar issues may also happen for convolutionBackwardData() and convolutionBackwardFilter(). This issue is present in cuDNN 7.2.2 library and later versions. This has been fixed in cuDNN 8.0.0.

  • Algorithms returned by cudnnGetConvolution*Algorithm() may, in some limited use cases, fail to execute when they are actually run. This is a cuDNN library-wide issue and applies for convolution forward, convolution backward data, and convolution backward filter operations. This issue is also present in versions prior to cuDNN 8.0.0 EA.

  • cuDNN does not support CUDA graphs. When launching a CUDA graph constructed via a stream capture that includes a cudnnConvolutionForward() operation, you may see cudaErrorLaunchFailure error. This is because CUDA graphs were not supported. The user can proceed.

  • There was a known performance drop in 3D convolutions for some cases on Turing GPUs since cuDNN 7.4.2. This has been fixed on T4. (not applicable for Jetson platforms)

  • There are rare cases where cudnnConvolution* will return STATUS_NOT_SUPPORTED when cudnn*GetWorkspaceSize might return success for a given algorithm. This has been fixed in cuDNN 8.0.0.

  • In previous versions of cuDNN, CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM did not propagate NaN values in some cases. This is fixed in the current release. Users desiring the old behavior can configure ReLU activation and set the floor to be -Inf.

  • The multiHeadAttention sample code was added to the cuDNN 7.6.3 release. The sample code includes a simple NumPy/Autograd reference model of the multi-head attention block that computes the forward response and all derivatives. The test code demonstrates how to use the multi-head attention API, access attention weights, and sequence data.

Known Issues

  • Performance regressions on V100 are observed in this release on SSD inference use cases if not using TensorRT.

  • There are significant performance regressions on pre-Volta GPUs and some Turing GPUs based on the TU102 architecture. This performance regression is not applicable to T4, JetPack, and Tegra.

  • Sub-optimal performance is present in this release for all INT8 convolutions for all GPUs.

  • The performance of cudnnConvolutionBiasActivationForward() is slower than v7.6 in most cases. This is being actively worked on and performance optimizations will be available in the upcoming releases.

  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur.

  • There are some peer-to-peer documentation links that are broken within the cuDNN API Reference. These links will be fixed in the next release.

  • cudnnCnnTrainVersionCheck() and cudnnCnnInferVersionCheck() are missing in this release and will be added in the GA release.

  • Documentation of RNN new APIs and deprecations is not complete. The cudnnRNNBackwardData_v8() and cudnnRNNBackwardWeights_v8() functions will be implemented in the next release.

  • cuDNN 8.0.0 Preview will not work with GA10x NVIDIA Ampere GPU architectures. This will be fixed in the next release.

  • cuDNN 8.0.0 Preview build with Windows and CUDA 11.0 RC has reduced performance on 2D, 3D, and grouped convolutions compared to Linux.

  • Updated: June 12, 2020

    There is a known issue in cuDNN 8.0.0 when linking statically to cuDNN and using the library's 3D algo1 backward filter convolutions. Users will see the library emit an internal error or incorrectly state that a shared library is missing. This is a bug that will be fixed in a future release.

  • Updated: June 25, 2019

    There is a known issue in cuDNN 8.0.0 when linking statically to cuDNN and using the library's 3D algo1 backward filter convolutions. Users will see the library emit an internal error or incorrectly state that a shared library is missing. This is a bug that will be fixed in a future release.

  • Updated: June 25, 2019
    When using an RPM file on RedHat for installation, installing cuDNN v8 directly or via TensorRT 7.1.3 will enable users to build their application with cuDNN v8. However, in order for the user to compile an application with cuDNN v7 after cuDNN v8 is installed, the user will need to perform the following steps:
    1. Issue sudo mv /usr/include/cudnn.h /usr/include/cudnn_v8.h.
    2. Issue sudo ln -s /etc/alternatives/libcudnn /usr/include/cudnn.h.
    3. Switch to cuDNN v7 by issuing sudo update-alternatives --config libcudnn and choose cuDNN v7 from the list.

    Steps 1 and 2 are required for the user to be able to switch between v7 and v8 installations. After steps 1 and 2 are performed once, step 3 can be used repeatedly and the user can choose the appropriate cuDNN version to work with. For more information, refer to the Installing From An RPM File and Upgrading From v7 To v8 sections in the cuDNN Installation Guide.