Attention: This is the cuDNN 8.0.0 Preview release. This Preview release
is for early testing and feedback, therefore, for production use of cuDNN,
continue to use
cuDNN 7.6.5. This release is
subject to change based on ongoing performance tuning and functional testing.
For feedback on the new backend API and deprecations, email
cudnn@nvidia.com.
These release notes are applicable to JetPack users of cuDNN unless appended
specifically with
(not applicable for Jetson platforms).
Note: cuDNN 8.0.0
passed GA quality testing and validation for TensorRT and JetPack
users.
For previous cuDNN documentation, see the cuDNN Archived Documentation.
Key Features and Enhancements
The following features and enhancements have been added to this release:
- cuDNN library
-
-
The
cuDNN library has been split into the
following libraries:
-
cudnn_ops_infer - This entity
contains the routines related to cuDNN context
creation and destruction, tensor descriptor
management, tensor utility routines, and the
inference portion of common machine learning
algorithms such as batch normalization, softmax,
dropout, etc.
-
cudnn_ops_train - This entity
contains common training routines and algorithms,
such as batch normalization, softmax, dropout,
etc. The cudnn_ops_train library
depends on cudnn_ops_infer.
-
cudnn_cnn_infer - This entity
contains all routines related to convolutional
neural networks needed at inference time. The
cudnn_cnn_infer library depends
on cudnn_ops_infer.
-
cudnn_cnn_train - This entity
contains all routines related to convolutional
neural networks needed during training time. The
cudnn_cnn_train library depends
on cudnn_ops_infer,
cudnn_ops_train, and
cudnn_cnn_infer.
-
cudnn_adv_infer - This entity
contains all other features and algorithms. This
includes RNNs, CTC loss, and multi-head attention.
The cudnn_adv_infer library
depends on cudnn_ops_infer.
-
cudnn_adv_train - This entity
contains all the training counterparts of
cudnn_adv_infer. The
cudnn_adv_train library depends
on cudnn_ops_infer,
cudnn_ops_train, and
cudnn_adv_infer.
-
cudnn - This is an optional shim
layer between the application layer and the cuDNN
code. This layer opportunistically opens the
correct library for the API at runtime.
-
cuDNN does not support mixing sub-library
versions. If there is a mismatch in the cuDNN
version numbers in the cuDNN sub-library
header files, the build will crash. The versions need to
match on the major number and minor number, as well as the
patch level.
-
The cuDNN sub-libraries must be installed
under a single directory.
- Multiple dynamic libraries
-
In order to link against a subset of
cuDNN, you need to
know which subset of the API you are using and then link against the
appropriate
cuDNN sub-components. The cuDNN
sub-components are as follows:
- cudnn_ops_infer.so
- cudnn_ops_train.so
- cudnn_cnn_infer.so
- cudnn_cnn_train.so
- cudnn_adv_infer.so
- cudnn_adv_train.so
- cuDNN linking options
-
There are two different linking options:
-
Linking against individual sub-libraries: Users who link
against individual sub-libraries must be able to identify
the API exposed by each cuDNN sub-library.
Users also need to know the hierarchy of the different cuDNN sub-libraries. Each
.so or .a needs to be
specified explicitly in the user’s linking command, as well
as any external dependencies cuDNN require.
For more information, refer to the Limitations
section below.
-
Linking against the full cuDNN (compatibility
option): This would allow users to use
-lcudnn. libcudnn.so
is provided as a shim layer that would open the appropriate
cuDNN sub-library for any particular cuDNN API call. While
libcudnn.a is largely unchanged, it is
a statically linked file for all of cuDNN.
- cuDNN loading options
-
For users who want a smaller memory footprint, there are 2 ways of loading the library.
-
Cherry-pick loading: Each sub-library is loaded only when
accessed. This will cause the first reference to that
sub-library to take a long time but will ensure the user
isn’t loading more libraries than they need.
-
All access loading: All available cuDNN
sub-libraries are loaded early during runtime.
- New API functions
-
For a list of functions and data types that were added in this release,
see API Changes For cuDNN
8.0.0.
- General Support of CUDA Graph Capture
-
CUDA Graphs are now supported for all functions in this release; with the following
restrictions.
- CUDA Toolkit 10.2 or higher is required
- cuDNN 8.0.0 graphs are captured via the CUDA graph-capture
APIs
- any non-default use of textures by users of cuDNN needs to be
disabled prior to capture
cuDNN 8.0.0 does not at this time offer API support to add operations to
an existing CUDA graph directly; however, the captured graph may be
added to an existing graph through the existing CUDA Graphs API.
Regarding texture usage, cuDNN 8.0.0 by default will not enable texture
usage; expert users may enable texture usage where allowed, but that
usage will prevent a successful CUDA Graph capture until disabled. In
order for cuDNN 8.0.0 to be graph-capture compatible library-wide, the
cuDNN 8.0.0 CTC API was updated as described elsewhere.
The usual restrictions for CUDA Graphs apply in addition to these
restrictions here.
- New APIs for convolution
-
A new set of API functions to provide a brand new approach to cuDNN that
offers more fine-grain control of performance, numerical properties,
etc.. for convolution. Using this API, users directly access various
engines that compute convolution forward propagation, backward data,
backward filter, and generic support for fusion starting with a limited
support in this cuDNN 8.0.0 release and expanding support in follow-up
releases. Each engine has performance tuning knobs such as GEMM tiling
and split-K. Users can use this API to fine-tune their network by
querying cuDNN’s heuristics, or doing their own, to find the most
optimal engine configuration with which cuDNN computes each network
layer.
- NVIDIA Ampere GPU architecture support (not applicable for Jetson
platforms)
-
- Added support for A100 GPU based on NVIDIA Ampere
architecture.
- cuDNN 8.0.0 has seen significant improvements when using A100
GPUs compared to Volta V100 with cuDNN 7.6.
- Added support for Tensor Float 32 (TF32) for 1D and 2D
convolutions. Full support for TF32 will come in future releases
such as grouped convolutions and 3D convolutions in addition to
further performance tuning.
- Increased performance for the legacy Tensor Cores (mixed
precision for 1D, 2D, 3D, and grouped convolutions.
- Turing and Volta architecture improvements
-
- New kernels for Tensor Cores and heuristics update for 1D
convolution resulting in performance improvements for speech
networks such as Jasper and
Tacotron2 and
WaveGlow, in addition to support for grouped 1D
conv (QuartzNet).
- Added 3D convolutions support of NHWC and improved heuristics
and kernels for Tensor Cores in NCHW resulting in performance
improvements for VNet, UNet-Medical
and UNet-Industrial. Additionally, FP16 3D
convolutions are supported as well.
- Better utilization of Tensor Cores and heuristics for grouped
convolutions result in improvements for ResNext.
- More tuning for vision networks like ResNet-50 ([MXNet] [PyTorch] [TensorFlow]) and SSD
([PyTorch] [TensorFlow]) with new
updated heuristics.
- Operation fusion
-
Operation fusion can be achieved via the backend API. The general
workflow is similar to running unfused operations, except that instead
of creating a single operation Operation Graph, the user may specify a
multi-operation Operation Graph. For more information, see Operation Fusion Via The Backend
API in the cuDNN Developer Guide.
- Depthwise convolution extension
-
We’ve extended the fprop and dgrad NHWC
depthwise kernels to support more combinations (filter sizes/strides)
such as 5x5/1x1, 5x5/2x2, 7x7/1x1, 7x7/2x2 (in addition to what we
already have, 1x1/1x1, 3x3/1x1, 3x3/2x2), which provides good
performance.
Compatibility
For the latest compatibility software versions of the OS, CUDA, the
CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.0.0.
Limitations
-
Samples must be installed in a writable location, otherwise the samples
can crash.
-
RNN and multi-head attention API calls may exhibit non-deterministic
behavior when the cuDNN 8.0.0 library is built with CUDA Toolkit 10.2 or
higher. This is the result of a new buffer management and heuristics in
the cuBLAS library. As described in the Results Reproducibility
section in the cuBLAS Library User Guide, numerical results may
not be deterministic when cuBLAS APIs are launched in more than one CUDA
stream via the same cuBLAS handle. This is caused by two buffer sizes
(16 KB and 4 MB) used in the default configuration.
When a larger buffer size is not available at runtime, instead of waiting
for a buffer of that size to be released, a smaller buffer may be used
with a different GPU kernel. The kernel selection may affect numerical
results. The user can eliminate the non-deterministic behavior of cuDNN
RNN and multi-head attention APIs, by setting a single buffer size in
the CUBLAS_WORKSPACE_CONFIG environmental variable, for
example, :16:8 or :4096:2.
The first configuration instructs cuBLAS to allocate eight buffers of 16
KB each in GPU memory while the second setting creates two buffers of 4
MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is
:16:8:4096:2, i.e., we have two buffer sizes. In
earlier cuBLAS libraries, such as cuBLAS 10.0, it used the
:16:8 non-adjustable configuration. When buffers of
only one size are available, the behavior of cuBLAS calls is
deterministic in multi-stream setups.
-
Some data types are not widely supported by all cuDNN API. For example,
CUDNN_DATA_INT8x4 is not supported by many
functions. In such cases, support is available by using cudnnTransformTensor()
to transform the tensors from the desired type to a type supported by
the API. For example, a user is able to transform input tensors from
CUDNN_DATA_INT8x4 to
CUDNN_DATA_INT8, run the desired API and then
transform output tensors from CUDNN_DATA_INT8 to
CUDNN_DATA_INT8x4. Note that this transformation
will incur an extra round trip to memory.
-
The tensor pointers and the filter pointers require at a minimum 4-byte
alignment, including INT8 data in the cuDNN library.
-
Some computational options in cuDNN 8.0.0 now require increased alignment
on tensors in order to run performantly. As always, cuDNN recommends
users to align tensors to 128-bit boundaries which will be sufficiently
aligned for any computational option in cuDNN. Doing otherwise may cause
performance regressions in cuDNN 8.0.0 compared to cuDNN v7.6.
-
For certain algorithms when the computation is in float (32-bit float)
and the output is in FP16 (half float), there are cases where the
numerical accuracy between the different algorithms might differ. cuDNN
8.0.0 users can target the backend API to query the numerical notes of
the algorithms to get the information programmatically. There are cases
where algo0 and algo1 will have a reduced precision accumulation when
users target the legacy API. In all cases, these numerical differences
are not known to affect training accuracy even though they might show up
in unit tests.
Deprecated Features
The following features are deprecated in
cuDNN 8.0.0:
-
Support for Ubuntu 14.04 has been deprecated in this release. Upgrade to
16.04 or 18.04 for continued support.
-
Support for Mac OS X has been deprecated in this release. Operating
systems that are currently supported are Linux and Windows.
-
cuDNN version 8 introduces a new API deprecation policy to
enable a faster pace of innovation. A streamlined, two-step, deprecation
policy will be used for all API changes starting with cuDNN version 8. For details about this new deprecation policy, see Backward Compatibility And
Deprecation Policy in the cuDNN Developer
Guide.
-
Removed and deprecated API changes. For a list of removed and deprecated
APIs, see API Changes For cuDNN
8.0.0.
Fixed Issues
The following issues have been fixed in this release:
-
There is a known issue in that cudnnDestroy() does not
destroy all that cudnnCreate() created. Calling
cudnnDestroy() after cudnnCreate() has
a memory leak in some tests of about 1.6 MB on host memory. This issue has
been fixed in cuDNN 8.0.0.
-
Starting in cuDNN 7.6.1, when using the experimental
multi-head attention API, it is possible that the forward and backward paths
produce different results for the BERT model, when the batch size is greater
than one and/or the number of heads is greater than one. This issue has been
fixed in cuDNN 8.0.0.
-
The description of cudnnSetCTCLossDescriptorEx() is not
clear. This issue has been fixed in cuDNN 8.0.0.
-
Documentation affecting 1x1 convolution functions is not clear, for example
cudnnFindConvolutionBackwardDataAlgorithm(). This issue
has been fixed in cuDNN 8.0.0.
-
cuDNN forward convolution with
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM does
not propagate NANs in weights. This issue has been fixed in cuDNN 8.0.0.
-
Document mathematical definitions of all operations in cuDNN.
We include full mathematical descriptions for the convolution functions.
-
The functions cudnnGetConvolutionForwardAlgorithm_v7() and
cudnnGetConvolutionForwardWorkspaceSize() may return
CUDNN_STATUS_SUCCESS while the execution of the same
convolution returns CUDNN_STATUS_NOT_SUPPORTED. Similar
issues may also happen for convolutionBackwardData() and
convolutionBackwardFilter(). This issue is present in
cuDNN 7.2.2 library and later versions. This has been fixed in cuDNN
8.0.0.
-
Algorithms returned by cudnnGetConvolution*Algorithm() may,
in some limited use cases, fail to execute when they are actually run. This
is a cuDNN library-wide issue and applies for convolution
forward, convolution backward data, and convolution backward filter
operations. This issue is also present in versions prior to cuDNN 8.0.0 EA.
-
cuDNN does not support CUDA graphs. When
launching a CUDA graph constructed via a stream capture that
includes a cudnnConvolutionForward() operation, you may see
cudaErrorLaunchFailure error. This is because CUDA graphs were not supported. The user can proceed.
-
There was a known performance drop in 3D convolutions for some cases on
Turing GPUs since cuDNN 7.4.2. This has been fixed on T4. (not applicable
for Jetson platforms)
-
There are rare cases where cudnnConvolution* will return
STATUS_NOT_SUPPORTED when
cudnn*GetWorkspaceSize might return success for a given
algorithm. This has been fixed in cuDNN 8.0.0.
-
In previous versions of cuDNN,
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM did
not propagate NaN values in some cases. This is fixed in the current
release. Users desiring the old behavior can configure ReLU activation and
set the floor to be -Inf.
-
The multiHeadAttention sample code was added to the cuDNN
7.6.3 release. The sample code includes a simple NumPy/Autograd reference
model of the multi-head attention block that computes the forward response
and all derivatives. The test code demonstrates how to use the multi-head
attention API, access attention weights, and sequence data.
Known Issues
-
Performance regressions on V100 are observed in this release on SSD
inference use cases if not using TensorRT.
-
There are significant performance regressions on pre-Volta GPUs and some
Turing GPUs based on the TU102 architecture. This performance regression
is not applicable to T4, JetPack, and Tegra.
-
Sub-optimal performance is present in this release for all INT8
convolutions for all GPUs.
-
The performance of cudnnConvolutionBiasActivationForward() is slower
than v7.6 in most cases. This is being actively worked on and
performance optimizations will be available in the upcoming
releases.
-
On K80 GPUs when cudnnConvolutionForward() is used with
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
algorithm and half input/output data types a silent error might
occur.
-
There are some peer-to-peer documentation links that are broken within
the cuDNN API Reference.
These links will be fixed in the next release.
-
cudnnCnnTrainVersionCheck() and
cudnnCnnInferVersionCheck() are missing in this
release and will be added in the GA release.
-
Documentation of RNN new APIs and deprecations is not complete. The
cudnnRNNBackwardData_v8() and
cudnnRNNBackwardWeights_v8() functions will be
implemented in the next release.
-
cuDNN 8.0.0 Preview will not work with GA10x NVIDIA Ampere GPU
architectures. This will be fixed in the next release.
-
cuDNN 8.0.0 Preview build with Windows and CUDA 11.0 RC has reduced
performance on 2D, 3D, and grouped convolutions compared to Linux.
- Updated: June 12, 2020
There is a known issue in cuDNN 8.0.0 when
linking statically to cuDNN and using the library's 3D algo1 backward
filter convolutions. Users will see the library emit an internal error
or incorrectly state that a shared library is missing. This is a bug
that will be fixed in a future release.
- Updated: June 25, 2019
There is a known issue in cuDNN 8.0.0 when
linking statically to cuDNN and using the library's 3D algo1 backward
filter convolutions. Users will see the library emit an internal error
or incorrectly state that a shared library is missing. This is a bug
that will be fixed in a future release.
- Updated: June 25, 2019
When using an RPM file on RedHat for
installation, installing cuDNN v8 directly or via TensorRT 7.1.3 will
enable users to build their application with cuDNN v8. However, in order
for the user to compile an application with cuDNN v7 after cuDNN v8 is
installed, the user will need to perform the following steps:
- Issue sudo mv /usr/include/cudnn.h
/usr/include/cudnn_v8.h.
- Issue sudo ln -s /etc/alternatives/libcudnn
/usr/include/cudnn.h.
- Switch to cuDNN v7 by issuing sudo update-alternatives
--config libcudnn and choose cuDNN v7 from the
list.
Steps 1 and 2 are required for the user to be able to switch
between v7 and v8 installations. After steps 1 and 2 are performed once,
step 3 can be used repeatedly and the user can choose the appropriate
cuDNN version to work with. For more information, refer to the Installing From An RPM
File and Upgrading From v7 To v8
sections in the cuDNN Installation Guide.