cuDNN Release Notes :: NVIDIA Deep Learning SDK Documentation

cuDNN Release 7.2.1

This is the cuDNN 7.2.1 release notes. This release includes fixes from the previous cuDNN v7.x.x releases as well as the following additional changes.

Key Features and Enhancements

The following enhancements have been added to this release:

The following new functions are added to provide support for the padding mask for the cudnnRNN* family of functions:
- cudnnSetRNNPaddingMode(): Enables/disables the padded RNN input/output.
- cudnnGetRNNPaddingMode(): Reads the padding mode status.
- cudnnCreateRNNDataDescriptor() and cudnnDestroyRNNDataDescriptor(): Creates and destroys, respectively, cudnnRNNDataDescriptor_t, an RNN data descriptor.
- cudnnSetRNNDataDescriptor() and cudnnGetRNNDataDescriptor(): Initializes and reads, respectively, the RNN data descriptor.
- cudnnRNNForwardTrainingEx(): An extended version of the cudnnRNNForwardTraining() to allow for the padded (unpacked) layout for the input/output.
- cudnnRNNForwardInferenceEx(): An extended version of the cudnnRNNForwardInference() to allow for the padded (unpacked) layout for the input/output.
- cudnnRNNBackwardDataEx(): An extended version of the cudnnRNNBackwardData() to allow for the padded (unpacked) layout for the input/output.
- cudnnRNNBackwardWeightsEx(): An extended version of the cudnnRNNBackwardWeights() to allow for the padded (unpacked) layout for the input/output.
Added support for cell clipping in cuDNN LSTM. The following new functions are added:
- cudnnRNNSetClip() and cudnnRNNGetClip(): Sets and retrieves, respectively, the LSTM cell clipping mode.
Accelerate your convolution computation with this new feature: When the input channel size c is a multiple of 32, you can use the new data type CUDNN_DATA_INT8x32 to accelerate your convolution computation.
Note: This new data type CUDNN_DATA_INT8x32 is only supported by sm_72.
Enhanced the family of cudnnFindRNN* functions. The findIntensity input to these functions now enable the user to control the overall runtime of the RNN find algorithms, by selecting a percentage of a large Cartesian product space to be searched.
A new mode CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is added to cudnnMathType_t. The computation time for FP32 tensors can be reduced by selecting this mode.
The functions cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData(), and cudnnRNNBackwardWeights() will now perform down conversion of FP32 input/output only when CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is set.
Improved the heuristics for cudnnGet*Algorithm() functions.

Known Issues and Limitations

Following issues and limitations exist in this release:

For FP16 inputs, the functions cudnnGetConvolutionForwardAlgorithm(), cudnnGetConvolutionBackwardDataAlgorithm(), and cudnnGetConvolutionBackwardFilterAlgorithm() will obtain a slower algorithm.
For cases where beta is not equal to zero, and when the input channel size is greater than 65535, then the below cudnnConvolutionBackwardFilter() algorithms may return EXECUTION_FAILED error:
- CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0,
- CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1, and
- CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3
This is a rare occurrence: When beta is not equal to zero, the function cudnnFindConvolutionBackwardFilterAlgorithm() may not return the fastest algorithm available for cudnnConvolutionBackwardFilter().
Grouped convolutions are not supported in the TRUE_HALF_CONFIG (convDesc is CUDNN_DATA_HALF) data type configuration. As a workaround, the PSEUDO_HALF_CONFIG (convDesc is CUDNN_DATA_FLOAT) data type configuration can be used without losing any precision.
For the cudnnConvolutionBiasActivationForward() function, if the input cudnnActivationMode_t is set to enum value CUDNN_ACTIVATION_IDENTITY, then the input cudnnConvolutionFwdAlgo_t must be set to the enum value CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM.
When the user runs cudnnRNNForward* or cudnnRNNBackward* with FP32 input/output, on sm_70 or sm_72, with RNN descriptor's algo field set toCUDNN_RNN_ALGO_PERSIST_STATIC, and math type set toCUDNN_TENSOR_OP_MATH viacudnnSetRNNMatrixMathType(), then the results are incorrect.
When the user runs cudnnRNNForward* or cudnnRNNBackward* with FP32 input/output, on sm_70 or sm_72, with RNN descriptor's algo field set toCUDNN_RNN_ALGO_PERSIST_STATIC, and math type set toCUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION viacudnnSetRNNMatrixMathType(), then the resulting performance is suboptimal.

Fixed Issues

The following issues have been fixed in this release:

The cudnnConvolutionBackwardData() function produced incorrect result under these conditions:
- The algo input is set to CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 in cudnnConvolutionBwdDataAlgo_t, and
- CUDNN_TENSOR_OP_MATH is selected.
  Under above conditions, the dgrad computation was giving incorrect results when the data is not packed and the data format is NCHW. This is fixed.
When the cudnnConvolutionFwdAlgo_t() was set to CONVOLUTION_FWD_ALGO_FFT_TILING then the function cudnnConvolutionForward() was leading to illegal memory access. This is now fixed.
cudnnPoolingBackward() was failing when using a large kernel size used for 'global_pooling' with NHWC I/O layout. This is fixed.
The below two items are fixed: If you set RNN mathtype to CUDNN_TENSOR_OP_MATH, and run RNN on sm6x or earlier hardware:
- a. You may have received CUDNN_STATUS_NOT_SUPPORTED when algo selected is CUDNN_RNN_ALGO_STANDARD or CUDNN_RNN_ALGO_PERSIST_STATIC.
- b. You may have received incorrect results when algo selected is CUDNN_RNN_ALGO_PERSIST_DYNAMIC.
If you passed in variable sequence length input tensor to cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData(), and used CUDNN_RNN_ALGO_PERSIST_STATIC or CUDNN_RNN_ALGO_PERSIST_DYNAMIC, then you may have received incorrect results. Now this is being checked, and CUDNN_STATUS_NOT_SUPPORTED will be returned.