This is the cuDNN 8.0.2 release notes and first GA release of cuDNN 8.x. This release
includes fixes from the previous cuDNN v8.0.x releases as well as the following additional
changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless
appended specifically with (not applicable for Jetson platforms).
Key Features and Enhancements
- cuDNN 8.0.1 Preview and 8.0.0 Preview
-
The key features mentioned in cuDNN 8.0.1 Preview and
8.0.0 Preview are
now GA quality in this release.
- Added new API functions to the documentation
-
cudnnRNNBackwardData_v8() and
cudnnRNNBackwardWeights_v8() are now documented
in the cudnn_adv_train.so Library. For a list of functions and data
types that were added in this release, see API Changes For cuDNN
8.0.2.
- TF32 performance
-
- TF32 for 3D convolutions and deconvolution performance is
significantly better, up to 3.9x, compared to cuDNN
8.0.1.
- TF32 for grouped convolutions on A100 were improved up to
1.5x performance compared to cuDNN 8.0.1 on ResNext
convolution layers and up to 3x the performance compared to
V100 with cuDNN v7.6. (not applicable for Jetson
platforms)
The above performance improvements were measured using only cuDNN
operations. The observed performance improvements will depend on a
number of factors, such as non-cuDNN operations, kernel run time,
and model architecture type.
- Performance improvements
-
This release includes performance improvements on all architectures
for 2D and 3D grouped convolutions compared with version 7.6.
Additionally, we improved kernel selection heuristics on several
known Deep Learning GitHub Examples
(also known as model scripts).
Compatibility
For the latest compatibility software versions of the OS, CUDA, the
CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.
Limitations
-
Samples must be installed in a writable location, otherwise the samples
can crash.
-
RNN and multi-head attention API calls may exhibit non-deterministic
behavior when the cuDNN 8.0.2 library is built with CUDA Toolkit 10.2 or
higher. This is the result of a new buffer management and heuristics in
the cuBLAS library. As described in the Results Reproducibility
section in the cuBLAS Library User Guide, numerical results may
not be deterministic when cuBLAS APIs are launched in more than one CUDA
stream via the same cuBLAS handle. This is caused by two buffer sizes
(16 KB and 4 MB) used in the default configuration.
When a larger buffer size is not available at runtime, instead of waiting
for a buffer of that size to be released, a smaller buffer may be used
with a different GPU kernel. The kernel selection may affect numerical
results. The user can eliminate the non-deterministic behavior of cuDNN
RNN and multi-head attention APIs, by setting a single buffer size in
the CUBLAS_WORKSPACE_CONFIG environmental variable, for
example, :16:8 or :4096:2.
The first configuration instructs cuBLAS to allocate eight buffers of 16
KB each in GPU memory while the second setting creates two buffers of 4
MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is
:16:8:4096:2, i.e., we have two buffer sizes. In
earlier cuBLAS libraries, such as cuBLAS 10.0, it used the
:16:8 non-adjustable configuration. When buffers of
only one size are available, the behavior of cuBLAS calls is
deterministic in multi-stream setups.
-
Some data types are not widely supported by all cuDNN API. For example,
CUDNN_DATA_INT8x4 is not supported by many
functions. In such cases, support is available by using cudnnTransformTensor()
to transform the tensors from the desired type to a type supported by
the API. For example, a user is able to transform input tensors from
CUDNN_DATA_INT8x4 to
CUDNN_DATA_INT8, run the desired API and then
transform output tensors from CUDNN_DATA_INT8 to
CUDNN_DATA_INT8x4. Note that this transformation
will incur an extra round trip to memory.
-
The tensor pointers and the filter pointers require at a minimum 4-byte
alignment, including INT8 data in the cuDNN library.
-
Some computational options in cuDNN 8.0.2 now require increased alignment
on tensors in order to run performantly. As always, cuDNN recommends
users to align tensors to 128-bit boundaries which will be sufficiently
aligned for any computational option in cuDNN. Doing otherwise may cause
performance regressions in cuDNN 8.0.2 compared to cuDNN v7.6.
-
For the _ALGO_0 algorithm of convolution backward data
and backward filter, grouped convolution with groups larger than 1 and
with odd product of dimensions C, D
(if 3D convolution), H, and W is not
supported on devices older than Volta. To prevent a potential illegal
memory access by an instruction that only has a 16-bit version in Volta
and above, pad at least one of the dimensions to an even value.
-
On K80 GPUs when cudnnConvolutionForward() is used with
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
algorithm and half input/output data types a silent error might occur
when the output width Q is 1 and both
height and width padding are zero.
-
Several cuDNN APIs are unable to directly support computations using
integer types (
CUDNN_DATA_INT8,
CUDNN_DATA_INT8x4,
CUDNN_DATA_INT8x32 or
CUDNN_DATA_INT32). Floating types (particularly
CUDNN_DATA_FLOAT) are much more widely supported.
If an API does not support the desired type,
cudnnTransformTensor() can be used to support the
use case by converting to/from a supported type and the desired type.
Here are the steps for doing so:
- Convert all input tensors from their native type to a supported
type (CUDNN_DATA_FLOAT is recommended).
- Run cuDNN API using the converted input tensors and output
tensor descriptors set as
CUDNN_DATA_FLOAT.
- Convert all output tensors from a supported type to your desired
output type.
Note: This will require extra memory use for the temporary buffers.
Further, this will introduce an additional round trip to memory
which might noticeably impact performance.
-
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are
limited to W >= (R-1) * dilationW && H >= (S-1) *
dilationH, whereas, in cuDNN v8.0.x, W == (R-1) *
dilationW || H == (S-1) * dilationH cases are no longer
supported.
-
In prior versions of cuDNN, some convolution algorithms can use texture-based load
instructure for performance improvements particularly in older hardware
architectures. Users can opt-out of using texture using the
environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x,
this variable is removed. Texture loading is turned off by default.
Users who wish to continue to use texture-based load, can adapt the new
backend API and toggle the engine knob
CUDNN_KNOB_TYPE_USE_TEX to 1 for
engines that support texture-based load instructions.
Fixed Issues
The following issues have been fixed in this release:
-
The implementation of cuDNNLRNCrossChannelBackward() for even-sized
normalization windows was incorrect in all previous releases. This issue
has been fixed in this release.
-
There isn’t a dedicated API to query the supported or the most performant
algo for cudnnConvolutionBiasActivationForward() in
cuDNN. It is not recommended to query w via
cudnnGetConvolutionForwardAlgorithm_v7. Instead, we
recommend using the cuDNN version 8 backend API. The number of supported
engines can be queried using enum
CUDNN_ATTR_OPERATIONGRAPH_ENGINE_GLOBAL_COUNT from
an operation graph descriptor via
cudnnBackendGetAttribute().
-
A memcheck error may have occurred on cuDNN version 7.x
builds when calling cudnnConvolutionBackwardFilter ()
on Volta or Turing GPUs. This issue has been fixed in this release.
-
Various convolutions which exhibited sub-optimal performance on GA100 GPU
are now achieving ideal performance. (not applicable for Jetson
platforms)
-
cudnnCnnTrainVersionCheck() and
cudnnCnnInferVersionCheck() were missing in past
releases. This issue has been fixed in this release.
-
Documentation of RNN new APIs and deprecations is not complete. The
cudnnRNNBackwardData_v8() and
cudnnRNNBackwardWeights_v8() have been added to
this release.
-
cuDNN 8.0.1 built with Windows and CUDA 11.0 RC had reduced performance
on 2D, 3D, and grouped convolutions compared to Linux. This issue has
been fixed in this release. (not applicable for Jetson
platforms)
-
There was a known issue in cuDNN 8.0.1 when linking statically to cuDNN
and using the library's 3D algo1 backward filter convolutions. Users
would see the library emit an internal error or incorrectly state that a
shared library was missing. This issue has been fixed in this
release.
-
When using an RPM file on RedHat for installation, upgrading from cuDNN
v7 to cuDNN v8 directly or indirectly via TensorRT 7.1.3 would cause
installation errors. This issue has been fixed in this release.
-
The implementation of cuDNNLRNCrossChannelBackward was
inconsistent with the implementation of
cuDNNLRNCrossChannelForward and returned incorrect
results when the normalization window was even. This issue has been
fixed in this release.
-
RNN APIs in cuDNN v8.0.1, compiled with CUDA 11.0, used an incorrect
default down-conversion on GPUs with CUDA SM version SM80 (NVIDIA Ampere
GPU family) when supplied input data and weights have the
CUDNN_DATA_FLOAT type and
cudnnMathType_t set via
cudnnSetRNNMatrixMathType() is
CUDNN_DEFAULT_MATH or
CUDNN_TENSOR_OP_MATH. Instead of using the default
TF32 computation when Tensor Cores are used, a down-conversion to FP16
(half-precision) was performed; same as in the
CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION mode. This
introduced a lower dynamic range of intermediate data but possibly
faster execution. To disable the automatic down-conversion of
CUDNN_DATA_FLOAT weights and data in RNN APIs, the
user needed to set the environmental variable
NVIDIA_TF32_OVERRIDE to 0 (notice
this would have disabled the use of TF32 in the entire library, which
might have a performance impact on CNNs that are not affected by this
issue). Another workaround was to assign the
CUDNN_FMA_MATH mode to the
cudnnMathType_t argument in
cudnnSetRNNMatrixMathType(). Due to this, the A100
GPU TF32 feature was not accessible for RNNs in cuDNN v8.0.1. This issue
has been fixed in this release. (not applicable for Jetson
platforms)
-
cuDNN convolution APIs may return
CUDNN_STATUS_EXECUTION_FAILED when the number of
input or output channels equals to or exceeds 2097152. This issue has
been fixed in this release.
-
Since version 8.0.0 Preview, cudnnConvolutionForward(),
cudnnConvolutionBackwardData(), and
cudnnConvolutionBackwardFilter() erroneously
returned CUDNN_STATUS_INTERNAL_ERROR when the workspace
size argument value was less than the required workspace size as
returned by their respective cudnnGetWorkspace() API.
This issue has been fixed and CUDNN_STATUS_BAD_PARAMS
is returned as documented.
Known Issues
-
The performance of cudnnConvolutionBiasActivationForward() for
true-half use cases on Pascal, INT8x4 use cases on Volta, and Turing,
compared to version 7.6 is still lower. In addition, FP32 and
pseudo-FP16 performance on Volta, Turing and the NVIDIA Ampere GPU
architecture is still not fully optimized.
-
The new RNN APIs: cudnnRNNForward(),
cudnnRNNBackwardData_v8(), and
cudnnRNNBackwardWeights_v8() are available as a
preview in the cuDNN 8.0.2 release.
-
Occasionally, inaccurate results were observed in outputs of the
cudnnRNNBackwardWeights() and
cudnnRNNBackwardWeightsEx() functions when the RNN
cell type was GRU and the NVIDIA Ampere GPU architecture was used with
FP32 I/O and mathType of
CUDNN_DEFAULT_MATH or
CUDNN_TENSOR_OP_MATH. Users may switch to
CUDNN_FMA_MATH as a temporary workaround. This
issue is being investigated.
-
cudnnRNN*() with LSTM mode may produce inaccurate
results on the cy outputs when clipping is enabled on
all GPUs. This issue exists in previous cuDNN releases as well.
-
On Volta and Pascal architectures, performance regressions may be present for various
TRUE_HALF convolutions.
-
When using cudnnRNN*Ex() APIs, if the user uses
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or
CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED as the
layout of the RNN data descriptors, and if the batch size is larger than
6144 on Volta or NVIDIA Ampere A100 GPUs, or larger than 4096 on Turing
GPUs, CUDNN_STATUS_EXECUTION_FAILED may be
returned.
-
Currently, there are
libcudnn_ops/cnn/adv_infer/train_static.a
binaries in the cuDNN Debian and tgz packages. Users are advised not to
link against those and link against libcudnn_static.a
instead. Those binaries will be removed from the release packages in the
next release.
-
When using cudnnRNN*Ex() APIs, if the user plans to use
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or
CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED as the
layout of the RNN data descriptors, the user should call
cudnnSetRNNPaddingMode() to set the mode to
CUDNN_RNN_PADDED_IO_ENABLED after initializing an
RNNDescriptor but before calling
cudnnGetRNNWorkspaceSize(). Not doing this may
result in CUDNN_STATUS_EXECUTION_FAILED.