Attention: This is the cuDNN 8.0.1 Preview release. This Preview release
is for early testing and feedback, therefore, for production use of cuDNN,
continue to use
cuDNN 7.6.5. This release is
subject to change based on ongoing performance tuning and functional testing.
For feedback on the new backend API and deprecations, email
cudnn@nvidia.com.
These release notes are applicable to JetPack users of cuDNN unless appended specifically with
(not applicable for Jetson platforms).
For previous cuDNN documentation, see the cuDNN Archived Documentation.
Key Features and Enhancements
- Added new kernels to improve the performance of fusion.
Compatibility
For the latest compatibility software versions of the OS, CUDA, the
CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.0.1.
Limitations
-
Samples must be installed in a writable location, otherwise the samples
can crash.
-
RNN and multi-head attention API calls may exhibit non-deterministic
behavior when the cuDNN 8.0.1 library is built with CUDA Toolkit 10.2 or
higher. This is the result of a new buffer management and heuristics in
the cuBLAS library. As described in the Results Reproducibility
section in the cuBLAS Library User Guide, numerical results may
not be deterministic when cuBLAS APIs are launched in more than one CUDA
stream via the same cuBLAS handle. This is caused by two buffer sizes
(16 KB and 4 MB) used in the default configuration.
When a larger buffer size is not available at runtime, instead of waiting
for a buffer of that size to be released, a smaller buffer may be used
with a different GPU kernel. The kernel selection may affect numerical
results. The user can eliminate the non-deterministic behavior of cuDNN
RNN and multi-head attention APIs, by setting a single buffer size in
the CUBLAS_WORKSPACE_CONFIG environmental variable, for
example, :16:8 or :4096:2.
The first configuration instructs cuBLAS to allocate eight buffers of 16
KB each in GPU memory while the second setting creates two buffers of 4
MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is
:16:8:4096:2, i.e., we have two buffer sizes. In
earlier cuBLAS libraries, such as cuBLAS 10.0, it used the
:16:8 non-adjustable configuration. When buffers of
only one size are available, the behavior of cuBLAS calls is
deterministic in multi-stream setups.
-
Some data types are not widely supported by all cuDNN API. For example,
CUDNN_DATA_INT8x4 is not supported by many
functions. In such cases, support is available by using cudnnTransformTensor()
to transform the tensors from the desired type to a type supported by
the API. For example, a user is able to transform input tensors from
CUDNN_DATA_INT8x4 to
CUDNN_DATA_INT8, run the desired API and then
transform output tensors from CUDNN_DATA_INT8 to
CUDNN_DATA_INT8x4. Note that this transformation
will incur an extra round trip to memory.
-
The tensor pointers and the filter pointers require at a minimum 4-byte
alignment, including INT8 data in the cuDNN library.
-
Some computational options in cuDNN 8.0.1 now require increased alignment
on tensors in order to run performantly. As always, cuDNN recommends
users to align tensors to 128-bit boundaries which will be sufficiently
aligned for any computational option in cuDNN. Doing otherwise may cause
performance regressions in cuDNN 8.0.1 compared to cuDNN v7.6.
-
For certain algorithms when the computation is in float (32-bit float)
and the output is in FP16 (half float), there are cases where the
numerical accuracy between the different algorithms might differ. cuDNN
8.0.1 users can target the backend API to query the numerical notes of
the algorithms to get the information programmatically. There are cases
where algo0 and algo1 will have a reduced precision accumulation when
users target the legacy API. In all cases, these numerical differences
are not known to affect training accuracy even though they might show up
in unit tests.
-
For the _ALGO_0 algorithm of convolution backward data
and backward filter, grouped convolution with groups larger than 1 and
with odd product of dimensions C, D
(if 3D convolution), H, and W is not
supported on devices older than Volta. To prevent a potential illegal
memory access by an instruction that only has a 16-bit version in Volta
and above, pad at least one of the dimensions to an even value.
-
On K80 GPUs when cudnnConvolutionForward() is used with
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
algorithm and half input/output data types a silent error might occur
when the output width Q is 1 and both
height and width padding are zero.
Fixed Issues
The following issues have been fixed in this release:
-
The dimA and strideA parameters in cudnnSetTensorNdDescriptor() do not document the tensor
layout. The documentation has been updated to include this information.
-
cuDNN 8.0.0 Preview will not work with GA10x NVIDIA Ampere GPU architectures.
This has been fixed in 8.0.1 Preview.
-
cuDNN 8.0.0 Preview removed a restriction on convolution backward filter for
output filter with odd products of dimensions (N*C*D*H*W)
for a kernel in algo0 for pre-Volta GPUs. This can
potentially lead to an illegal memory access error. This restriction is
restored in cuDNN 8.0.1 Preview. cuDNN will use a kernel that does not have
this restriction for this computation case.
-
Fixed performance issues for pre-Vola architectures for convolutions (except
when the compute type is half).
- Mitigated the performance regression to less than 10% end-to-end.
Known Issues
-
On pre-Volta, there are significant performance issues on convolution
layers when the compute type is half.
-
Sub-optimal performance is present in this release for all INT8
convolutions for all GPUs.
-
The performance of cudnnConvolutionBiasActivationForward() is slower
than v7.6 in most cases. This is being actively worked on and
performance optimizations will be available in the upcoming
releases.
-
There are some peer-to-peer documentation links that are broken within
the cuDNN API Reference.
These links will be fixed in the next release.
-
cudnnCnnTrainVersionCheck() and
cudnnCnnInferVersionCheck() are missing in this
release and will be added in the GA release.
-
Documentation of RNN new APIs and deprecations is not complete. The
cudnnRNNBackwardData_v8() and
cudnnRNNBackwardWeights_v8() functions will be
implemented in the next release.
-
cuDNN 8.0.1 Preview build with Windows and CUDA 11.0 RC has reduced
performance on 2D, 3D, and grouped convolutions compared to Linux.
-
There is a known issue in cuDNN 8.0.1 when linking statically to cuDNN
and using the library's 3D algo1 backward filter convolutions. Users
will see the library emit an internal error or incorrectly state that a
shared library is missing. This is a bug that will be fixed in a future
release.
-
When using an RPM file on RedHat for installation, installing cuDNN v8
directly or via TensorRT 7.1.3 will enable users to build their
application with cuDNN v8. However, in order for the user to compile an
application with cuDNN v7 after cuDNN v8 is installed, the user will
need to perform the following steps:
- Issue sudo mv /usr/include/cudnn.h
/usr/include/cudnn_v8.h.
- Issue sudo ln -s /etc/alternatives/libcudnn
/usr/include/cudnn.h.
- Switch to cuDNN v7 by issuing sudo update-alternatives
--config libcudnn and choose cuDNN v7 from the
list.
Steps 1 and 2 are required for the user to be able to switch between v7
and v8 installations. After steps 1 and 2 are performed once, step 3 can
be used repeatedly and the user can choose the appropriate cuDNN version
to work with. For more information, refer to the Installing From An RPM
File and Upgrading From v7 To v8
sections in the cuDNN Installation Guide.
-
When FFT Tiled aglo (i.e.,
CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING in forward
convolution or
CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING for
backward data) is used for 3D convolution, an intermittent silent
failure might happen due to an incorrect stream used for kernel
execution. In some cases, this might be manifested as undefined values
seen in the output.
-
The implementation of cuDNNLRNCrossChannelBackward is
inconsistent with the implementation of
cuDNNLRNCrossChannelForward and returns incorrect
results when the normalization window is even. This will be fixed in a
future release.
-
RNN APIs in cuDNN v8.0.1, compiled with CUDA 11.0, use an incorrect default down-conversion
on GPUs with CUDA SM version SM80 (NVIDIA Ampere GPU family) when
supplied input data and weights have the
CUDNN_DATA_FLOAT type and
cudnnMathType_t set via
cudnnSetRNNMatrixMathType() is
CUDNN_DEFAULT_MATH or
CUDNN_TENSOR_OP_MATH. Instead of using the default
TF32 computation when Tensor Cores are used, a down-conversion to FP16
(half-precision) is performed; same as in the
CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION mode. This
introduces a lower dynamic range of intermediate data but possibly
faster execution. To disable the automatic down-conversion of
CUDNN_DATA_FLOAT weights and data in RNN APIs, set
the environmental variable NVIDIA_TF32_OVERRIDE to
0 (notice this will disable the use of TF32 in the
entire library, which might have a performance impact on CNNs that are
not affected by this issue). Another workaround is to assign the
CUDNN_FMA_MATH mode to the
cudnnMathType_t argument in
cudnnSetRNNMatrixMathType(). Due to this, the A100
TF32 feature is not accessible for RNNs in cuDNN v8.0.1.
-
Several cuDNN APIs are unable to directly support computations using
integer types (
CUDNN_DATA_INT8,
CUDNN_DATA_INT8x4,
CUDNN_DATA_INT8x32 or
CUDNN_DATA_INT32). Floating types (particularly
CUDNN_DATA_FLOAT) are much more widely supported.
If an API does not support the desired type,
cudnnTransformTensor() can be used to support the
use case by converting to/from a supported type and the desired type.
Here are the steps for doing so:
- Convert all input tensors from their native type to a supported
type (CUDNN_DATA_FLOAT is recommended).
- Run cuDNN API using the converted input tensors and output
tensor descriptors set as
CUDNN_DATA_FLOAT.
- Convert all output tensors from a supported type to your desired
output type.
Note: This will require extra memory use for the temporary buffers.
Further, this will introduce an additional round trip to memory
which might noticeably impact performance.