This is the cuDNN 7.6.1 release notes. This release includes fixes from the previous
cuDNN v7.x.x releases as well as the following additional changes.
Key Features and Enhancements
The following features and enhancements have been added to this release:
- Performance is enhanced for 3D convolutions using Tensor Core for FP16 input and output data
types, whenever they are supported. Moreover, for single-precision (FP32)
input/output, cuDNN 7.6.1 will use these enhanced kernels whenever possible, and
only when cudnnMathType_t is set to
CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION. See cudnnConvolutionForward() and cudnnConvolutionBackwardData() and
cudnnConvolutionBackwardFilter().
- On Maxwell and Pascal architectures only, the performance of 3D convolutions with the kernel
size of 128^3, when used with
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1, is enhanced.
- API logging is fully implemented for
the experimental multihead attention API, namely, for the following
functions:
- Performance of the experimental multihead attention forward API is enhanced. See cudnnMultiHeadAttnForward().
- Performance is enhanced for the fused convolution and fused wgrad fallback
path. See cudnnFusedOps_t.
Fixed Issues
The following issues have been
fixed in this release:
- In cuDNN 7.6.0, the function cudnnGetConvolutionBackwardDataWorkspaceSize() returns a value for
which cudnnConvolutionBackwardData(), when
used with CUDNN_CONVOLUTION_BWD_DATA_ALGO_0, returns
CUDNN_STATUS_NOT_SUPPORTED. This is fixed in cuDNN 7.6.1 so
that now cudnnGetConvolutionBackwardDataWorkspaceSize() returns
a proper value for cudnnConvolutionBackwardData().
- In cuDNN 7.6.0 and earlier versions, when all the following conditions are true,
- RNN model is bi-directional,
- Cell type is LSTM,
- cudnnRNNAlgo_t= CUDNN_RNN_ALGO_STANDARD, and
- Dropout probability was greater than zero,
then the cudnnRNNBackwardWeights() function produces inaccurate and occasionally non-deterministic results.
This is fixed in cuDNN 7.6.1.
An underlying issue, where the same buffer was used for left-to-right and right-to-left directions when re-computing forward
dropout results passed from one RNN layer to the next, was the cause of the bug.
- A bug in cuDNN 7.6.0 and earlier versions, in the cudnnRNNForwardTraining() function,
related to dropout, is fixed in cuDNN 7.6.1.
When all the following
conditions are true:
- cudnnRNNAlgo_t=CUDNN_RNN_ALGO_PERSIST_STATIC,
- cudnnMathType_t is
CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION, and
- input data type is CUDNN_DATA_FLOAT,
then the FP32-to-FP16 conversion might be applied as a performance optimization.
When this down conversion is scheduled, a GPU kernel invoked by cudnnDropoutForward() would crash due to incorrect parameters being passed. In this case CUDA runtime reports the "misaligned address" error when
reading the data from global memory.
- In cuDNN 7.6.0, on RHEL7 only, the
/usr/src/cudnn_samples_v7/samples_common.mk file is missing. This
requires a workaround to compile the cuDNN samples. This is fixed in cuDNN 7.6.1 and
the workaround is not needed for cuDNN 7.6.1 .
- In cuDNN 7.6.0, on pre-Volta hardware only, the function cudnnGetConvolutionBackwardFilterWorkspaceSize() can erroneously
return CUDNN_STATUS_SUCCESS for cudnnConvolutionBackwardFilter() for
3D convolutions, using CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1 with
NDHWC layout. When this occurs, the
cudnnConvolutionBackwardFilter() function will process the
data using a kernel that expects the data in NCDHW layout (the only format
supported by wDesc in this case), leading to incorrect results.
In cuDNN 7.6.1, this is fixed so that
cudnnGetConvolutionBackwardFilterWorkspaceSize() will now
return CUDNN_STATUS_NOT_SUPPORTED.
- In cuDNN 7.5.x and 7.6.0 for Jetson platform, in some cases the function cudnnConvolutionBackwardData() , when
used with CUDNN_CONVOLUTION_BWD_DATA_ALGO_WINOGRAD, might
return incorrect results. This is fixed in cuDNN 7.6.1.
- When the data type configuration is FLOAT_CONFIG, then
cudnnGetConvolution*Algorithm(), for a few
convolution sizes, incorrectly returns a slow algorithm for the Pascal
architecture. This is fixed in cuDNN 7.5.0 and later versions.
- When using the fusedOps API with the enum
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_CONV_BNSTATS or
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD, and when input
tensor is in NCHW format or is not fully-packed, then incorrect results may be
produced. This is now fixed in cuDNN 7.6.1.
Known Issues
The following issues and limitations exist in this release:
- Algorithms returned by cudnnGetConvolution*Algorithm()
may, in some limited use cases, fail to execute when they are actually run. This is a
cuDNN library-wide issue and applies for convolution forward, convolution backward
data, and convolution backward filter operations. This issue is also present in
versions prior to cuDNN 7.6.1.
- When the input and output tensors are in NHWC and the filter is 1x1 and NCHW, the
performance of the function cudnnConvolutionBackwardData() might
be degraded.
- In cuDNN 7.6.1, when using the experimental multi-head attention API, it is possible
that the forward and backward paths produce different results for the BERT model,
when the batch size is greater than one and/or the number of heads is greater than
one.
- In cuDNN 7.6.1, on Volta architecture only, there may be a performance degradation when the
function cudnnConvolutionBackwardFilter() is
used for 3D convolutions with
CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1.
- In cuDNN 7.6.1, on Turing and Pascal architectures, performance may be degraded for cudnnConvolutionBackwardData(), when
used with the following conditions:
- CUDNN_CONVOLUTION_BWD_DATA_ALGO_0 for 3D convolutions
- wDesc, dyDesc and
dxDesc are all in NCDHW
- Data type configuration is FLOAT_CONFIG (i.e., single
precision data and compute)