cuDNN Release Notes v7.2.1
Key Features and Enhancements
The following enhancements have been added to this release:
- The following new functions are added to provide support for the padding
mask for the cudnnRNN* family of functions:
- cudnnSetRNNPaddingMode(): Enables/disables the padded RNN input/output.
- cudnnGetRNNPaddingMode(): Reads the padding mode status.
- cudnnCreateRNNDataDescriptor() and cudnnDestroyRNNDataDescriptor(): Creates and destroys, respectively, cudnnRNNDataDescriptor_t, an RNN data descriptor.
- cudnnSetRNNDataDescriptor() and cudnnGetRNNDataDescriptor(): Initializes and reads, respectively, the RNN data descriptor.
- cudnnRNNForwardTrainingEx(): An extended version of the cudnnRNNForwardTraining() to allow for the padded (unpacked) layout for the input/output.
- cudnnRNNForwardInferenceEx(): An extended version of the cudnnRNNForwardInference() to allow for the padded (unpacked) layout for the input/output.
- cudnnRNNBackwardDataEx(): An extended version of the cudnnRNNBackwardData() to allow for the padded (unpacked) layout for the input/output.
- cudnnRNNBackwardWeightsEx(): An extended version of the cudnnRNNBackwardWeights() to allow for the padded (unpacked) layout for the input/output.
-
Added support for cell clipping in cuDNN LSTM. The following new functions are added:
- cudnnRNNSetClip() and cudnnRNNGetClip(): Sets and retrieves, respectively, the LSTM cell clipping mode.
- Accelerate your convolution computation with this new feature: When the
input channel size c is a multiple of 32, you can use the
new data type CUDNN_DATA_INT8x32 to accelerate your convolution computation.
Note: This new data type CUDNN_DATA_INT8x32 is only supported by sm_72.
- Enhanced the family of cudnnFindRNN* functions. The findIntensity input to these functions now enable the user to control the overall runtime of the RNN find algorithms, by selecting a percentage of a large Cartesian product space to be searched.
- A new mode CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is added to cudnnMathType_t. The computation time for FP32 tensors can be reduced by selecting this mode.
- The functions cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData(), and cudnnRNNBackwardWeights() will now perform down conversion of FP32 input/output only when CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is set.
- Improved the heuristics for cudnnGet*Algorithm() functions.
Known Issues and Limitations
Following issues and limitations exist in this release:
- For FP16 inputs, the functions cudnnGetConvolutionForwardAlgorithm(), cudnnGetConvolutionBackwardDataAlgorithm(), and cudnnGetConvolutionBackwardFilterAlgorithm() will obtain a slower algorithm.
- For cases where beta is not equal to zero, and when the input
channel size is greater than 65535, then the below
cudnnConvolutionBackwardFilter() algorithms may return
EXECUTION_FAILED error:
- CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0,
- CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1, and
- CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3
- This is a rare occurrence: When beta is not equal to zero, the function cudnnFindConvolutionBackwardFilterAlgorithm() may not return the fastest algorithm available for cudnnConvolutionBackwardFilter().
- Grouped convolutions are not supported in the TRUE_HALF_CONFIG (convDesc is CUDNN_DATA_HALF) data type configuration. As a workaround, the PSEUDO_HALF_CONFIG (convDesc is CUDNN_DATA_FLOAT) data type configuration can be used without losing any precision.
- For the cudnnConvolutionBiasActivationForward() function, if the input cudnnActivationMode_t is set to enum value CUDNN_ACTIVATION_IDENTITY, then the input cudnnConvolutionFwdAlgo_t must be set to the enum value CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM.
- When the user runs cudnnRNNForward* or cudnnRNNBackward* with FP32 input/output, on sm_70 or sm_72, with RNN descriptor's algo field set to CUDNN_RNN_ALGO_PERSIST_STATIC, and math type set to CUDNN_TENSOR_OP_MATH via cudnnSetRNNMatrixMathType(), then the results are incorrect.
- When the user runs cudnnRNNForward* or cudnnRNNBackward* with FP32 input/output, on sm_70 or sm_72, with RNN descriptor's algo field set to CUDNN_RNN_ALGO_PERSIST_STATIC, and math type set to CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION via cudnnSetRNNMatrixMathType(), then the resulting performance is suboptimal.
Fixed Issues
The following issues have been fixed in this release:
- The cudnnConvolutionBackwardData() function produced incorrect
result under these conditions:
- The algo input is set to CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 in cudnnConvolutionBwdDataAlgo_t, and
- CUDNN_TENSOR_OP_MATH is selected.
Under above conditions, the dgrad computation was giving incorrect results when the data is not packed and the data format is NCHW. This is fixed.
-
When the cudnnConvolutionFwdAlgo_t() was set to CONVOLUTION_FWD_ALGO_FFT_TILING then the function cudnnConvolutionForward() was leading to illegal memory access. This is now fixed.
- cudnnPoolingBackward() was failing when using a large kernel size used for 'global_pooling' with NHWC I/O layout. This is fixed.
- The below two items are fixed: If you set RNN mathtype to CUDNN_TENSOR_OP_MATH,
and run RNN on sm6x or earlier hardware:
- a. You may have received CUDNN_STATUS_NOT_SUPPORTED when algo selected is CUDNN_RNN_ALGO_STANDARD or CUDNN_RNN_ALGO_PERSIST_STATIC.
- b. You may have received incorrect results when algo selected is CUDNN_RNN_ALGO_PERSIST_DYNAMIC.
- If you passed in variable sequence length input tensor to cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData(), and used CUDNN_RNN_ALGO_PERSIST_STATIC or CUDNN_RNN_ALGO_PERSIST_DYNAMIC, then you may have received incorrect results. Now this is being checked, and CUDNN_STATUS_NOT_SUPPORTED will be returned.