cuDNN Release Notes :: NVIDIA Deep Learning SDK Documentation

cuDNN Release 7.6.3

This is the cuDNN 7.6.3 release notes. This release includes fixes from the previous cuDNN v7.x.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN release notes, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:

The cuDNN 7.6.3 library now supports auto-padding for NHWC layout. The functional behavior, and the benefits of auto-padding as follows: (not applicable for Jetson platforms)
- For use cases where C and K dimensions of input and filter Tensors are not multiples of 8, the auto-padding feature increases the Tensor size so that the Tensor dimensions are multiples of 8.
- With auto-padding the cuDNN library invokes faster kernels, thereby improving the performance.
- With auto-padding, the performance with NHWC data layout is now comparable to that of the NCHW layout.
Added support for dataType=CUDNN_DATA_HALF and computePrec=CUDNN_DATA_HALF in multi-head attention forward (cudnnMultiHeadAttnForward()) and backward (gradient) (cudnnMultiHeadAttnBackwardData() and cudnnMultiHeadAttnBackwardWeights()) API functions. (not applicable for Jetson platforms)
Multi-head attention API now supports bias after the projections on Q, K, V, and O in the cudnnMultiHeadAttnForward() call (backward bias gradient is not yet supported). (not applicable for Jetson platforms)

The new feature required a small API change in cudnnSetAttnDescriptor(): the cudnnAttnQueryMap_t queryMap argument is replaced with unsigned attnMode to pass various on and off options. This change is backward compatible with earlier API versions. (not applicable for Jetson platforms)
Significantly improved the performance in typical multi-head attention use cases in forward inference and training, especially when the vector length of each head is a multiple of 32 up to 128. (not applicable for Jetson platforms)
Tensor Core support is added for true half and single precision use cases in multi-head attention. Users may utilize it by setting the mathType argument in cudnnSetAttnDescriptor() to CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION. (not applicable for Jetson platforms)
The multiHeadAttention sample code is added. The sample code includes a compact NumPy/Autograd reference model of the multi-head attention block that computes the forward response and all first-order derivatives. The test code demonstrates how to use the multi-head attention API, access attention weights, and sequence data. (not applicable for Jetson platforms)
Improved depth-wise convolution for forward, dgrad, and wgrad under the following conditions:
- Algorithm is algo1
- Tensor format for filter is NCHW (wgrad supports NHWC also)
- Input and outputs are in FP16 and computation is in FP32
- Filter size: 1x1, 3x3, 5x5, 7x7 (dgrad only supports stride 1)
- Math type is CUDNN_DEFAULT_MATH
Improved grouped convolution for cudnnConvolutionBackwardFilter() in the configuration below:
- Algorithm is CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1
- Math type is CUDNN_DEFAULT_MATH
- Tensor format for filter is NCHW
- Input and outputs are in FP16 and computation is in FP32
- Filter size: 1x1, 3x3, 5x5, 7x7
Improved the performance of grouped convolution, for cudnnConvolutionForward() in the configuration below:
- Algorithm is CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
- Math type is CUDNN_TENSOR_OP_MATH or CUDNN_TENSOROP_MATH_ALLOW_CONVERSION
- Tensor format for filter is NHWC
- Input and outputs are in FP16 and computation is in FP16/ FP32
- Per group C & K == 4/8/16/32
- Filter size: 3x3
Improved the performance of grouped convolution, for cudnnConvolutionBackwardFilter() in the configuration below:
- Algorithm is CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1
- Math type is CUDNN_TENSOR_OP_MATH or CUDNN_TENSOROP_MATH_ALLOW_CONVERSION
- Tensor format for filter is NHWC
- Input and outputs are in FP16 and computation is in FP32
- On NVIDIA Volta (compute capability 7.0)
- Per group C & K == 4/8/16/32
- Filter size: 1x1, 3x3

Fixed Issues

The following issues have been fixed in this release:

Fixed an issue where cudnnMultiHeadAttnBackwardData() was producing incorrect results when K sequence length is longer than 32.
Fixed a race condition in cudnnMultiHeadAttnBackwardData() that was producing intermittent incorrect results.
The function cudnnCTCLoss() produced incorrect gradient result for label whose length is smaller than the maximal sequence length in the batch. This is fixed in cuDNN 7.6.3.