cuDNN Release Notes v7.6.3

Key Features and Enhancements

The following features and enhancements have been added to this release:

  • The cuDNN 7.6.3 library now supports auto-padding for NHWC layout. The functional behavior, and the benefits of auto-padding as follows:
    • For use cases where C and K dimensions of input and filter Tensors are not multiples of 8, the auto-padding feature increases the Tensor size so that the Tensor dimensions are multiples of 8.
    • With auto-padding the cuDNN library invokes faster kernels, thereby improving the performance.
    • With auto-padding, the performance with NHWC data layout is now comparable to that of the NCHW layout.
  • Added support for dataType=CUDNN_DATA_HALF and computePrec=CUDNN_DATA_HALF in multi-head attention forward (cudnnMultiHeadAttnForward) and backward (gradient) (cudnnMultiHeadAttnBackwardData and cudnnMultiHeadAttnBackwardWeights) API functions.
  • Multi-head attention API now supports bias after the projections on Q, K, V, and O in the cudnnMultiHeadAttnForward() call (backward bias gradient is not yet supported).

    The new feature required a small API change in cudnnSetAttnDescriptor(): the cudnnAttnQueryMap_t queryMap argument is replaced with unsigned attnMode to pass various on and off options. This change is backward compatible with earlier API versions.

  • Significantly improved the performance in typical multi-head attention use cases in forward inference and training, especially when the vector length of each head is a multiple of 32 up to 128.
  • Tensor Core support is added for true half and single precision use cases in multi-head attention. Users may utilize it by setting the mathType argument in cudnnSetAttnDescriptor() to CUDNN_TENSOR_OP_MATH or CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION.
  • The multiHeadAttention sample code is added. The sample code includes a compact NumPy/Autograd reference model of the multi-head attention block that computes the forward response and all first-order derivatives. The test code demonstrates how to use the multi-head attention API, access attention weights, and sequence data.
  • Improved depth-wise convolution for forward, dgrad, and wgrad under the following conditions:
    • Algorithm is algo1
    • Tensor format for filter is NCHW (wgrad supports NHWC also)
    • Input and outputs are in FP16 and computation is in FP32
    • Filter size: 1x1, 3x3, 5x5, 7x7 (dgrad only supports stride 1)
    • Math type is CUDNN_DEFAULT_MATH
  • Improved grouped convolution for cudnnConvolutionBackwardFilter() in the configuration below:
    • Algorithm is CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1
    • Math type is CUDNN_DEFAULT_MATH
    • Tensor format for filter is NCHW
    • Input and outputs are in FP16 and computation is in FP32
    • Filter size: 1x1, 3x3, 5x5, 7x7
  • Improved the performance of grouped convolution, for cudnnConvolutionForward() in the configuration below:
    • Algorithm is CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
    • Math type is CUDNN_TENSOR_OP_MATH or CUDNN_TENSOROP_MATH_ALLOW_CONVERSION
    • Tensor format for filter is NHWC
    • Input and outputs are in FP16 and computation is in FP16/ FP32
    • Per group C & K == 4/8/16/32
    • Filter size: 3x3
  • Improved the performance of grouped convolution, for cudnnConvolutionBackwardFilter() in the configuration below:
    • Algorithm is CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1
    • Math type is CUDNN_TENSOR_OP_MATH or CUDNN_TENSOROP_MATH_ALLOW_CONVERSION
    • Tensor format for filter is NHWC
    • Input and outputs are in FP16 and computation is in FP32
    • On NVIDIA Volta (compute capability 7.0)
    • Per group C & K == 4/8/16/32
    • Filter size: 1x1, 3x3

Fixed Issues

The following issues have been fixed in this release:

  • Fixed an issue where cudnnMultiHeadAttnBackwardData was producing incorrect results when K sequence length is longer than 32.
  • Fixed a race condition in cudnnMultiHeadAttnBackwardData that was producing intermittent incorrect results.
  • The function cudnnCTCLoss() produced incorrect gradient result for label whose length is smaller than the maximal sequence length in the batch. This is fixed in cuDNN 7.6.3.