cuDNN Release Notes v7.0.2

Key Features and Enhancements

This is a patch release of cuDNN 7.0 and includes bug fixes and performance improvements mainly on Volta.

Algo 1 Convolutions Performance Improvements
Performance improvements were made to CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM, CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1, and CUDNN_CONVOLUTION_BWD_DATA_ALGO_1. These improvements consist of new SASS kernels and improved heuristics. The new kernels implement convolutions over various data sizes and tile sizes. The improved heuristics take advantage of these new kernels.

Known Issues

The following are known issues in this release:

  • cudnnGetConvolutionForwardWorkspaceSize() returns overflowed size_t value for certain input shape for CUDNN_CONVOLUTION_*_ALGO_FFT_TILING.
  • cudnnPoolingBackward() fails for pooling window size > 256.

Fixed Issues

The following issues have been fixed in this release:

  • Batch Norm CUDNN_BATCHNORM_SPATIAL_PERSISTENT might get into race conditions in certain scenarios.
  • cuDNN convolution layers using TENSOR_OP_MATH with fp16 inputs and outputs and fp32 compute will use “round to nearest” mode instead of “round to zero” mode as in 7.0.1. This rounding mode has proven to achieve better results in training.
  • Fixed synchronization logic in the CUDNN_CTC_LOSS_ALGO_DETERMINISTIC algo for CTC. The original code would hang in rare cases.
  • Convolution algorithms using TENSOR_OP_MATH returned a workspace size from *GetWorkspaceSize() smaller than actually necessary.
  • The results of int8 are inaccurate in certain cases when calling cudnnConvolutionForward() in convolution layer.
  • cudnnConvolutionForward() called with xDesc’s channel = yDesc’s channel = groupCount could compute incorrect values when vertical padding > 0.