cuDNN Release Notes v7.2.1

Key Features and Enhancements

The following enhancements have been added to this release:

  • The following new functions are added to provide support for the padding mask for the cudnnRNN* family of functions:
    • cudnnSetRNNPaddingMode(): Enables/disables the padded RNN input/output.
    • cudnnGetRNNPaddingMode(): Reads the padding mode status.
    • cudnnCreateRNNDataDescriptor() and cudnnDestroyRNNDataDescriptor(): Creates and destroys, respectively, cudnnRNNDataDescriptor_t, an RNN data descriptor.
    • cudnnSetRNNDataDescriptor() and cudnnGetRNNDataDescriptor(): Initializes and reads, respectively, the RNN data descriptor.
    • cudnnRNNForwardTrainingEx(): An extended version of the cudnnRNNForwardTraining() to allow for the padded (unpacked) layout for the input/output.
    • cudnnRNNForwardInferenceEx(): An extended version of the cudnnRNNForwardInference() to allow for the padded (unpacked) layout for the input/output.
    • cudnnRNNBackwardDataEx(): An extended version of the cudnnRNNBackwardData() to allow for the padded (unpacked) layout for the input/output.
    • cudnnRNNBackwardWeightsEx(): An extended version of the cudnnRNNBackwardWeights() to allow for the padded (unpacked) layout for the input/output.
  • Added support for cell clipping in cuDNN LSTM. The following new functions are added:

    • cudnnRNNSetClip() and cudnnRNNGetClip(): Sets and retrieves, respectively, the LSTM cell clipping mode.
  • Accelerate your convolution computation with this new feature: When the input channel size c is a multiple of 32, you can use the new data type CUDNN_DATA_INT8x32 to accelerate your convolution computation.
    Note: This new data type CUDNN_DATA_INT8x32 is only supported by sm_72.
  • Enhanced the family of cudnnFindRNN* functions. The findIntensity input to these functions now enable the user to control the overall runtime of the RNN find algorithms, by selecting a percentage of a large Cartesian product space to be searched.
  • A new mode CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is added to cudnnMathType_t. The computation time for FP32 tensors can be reduced by selecting this mode.
  • The functions cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData(), and cudnnRNNBackwardWeights() will now perform down conversion of FP32 input/output only when CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION is set.
  • Improved the heuristics for cudnnGet*Algorithm() functions.

Known Issues and Limitations

Following issues and limitations exist in this release:

  • For FP16 inputs, the functions cudnnGetConvolutionForwardAlgorithm(), cudnnGetConvolutionBackwardDataAlgorithm(), and cudnnGetConvolutionBackwardFilterAlgorithm() will obtain a slower algorithm.
  • For cases where beta is not equal to zero, and when the input channel size is greater than 65535, then the below cudnnConvolutionBackwardFilter() algorithms may return EXECUTION_FAILED error:
    • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0,
    • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1, and
    • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_3
  • This is a rare occurrence: When beta is not equal to zero, the function cudnnFindConvolutionBackwardFilterAlgorithm() may not return the fastest algorithm available for cudnnConvolutionBackwardFilter().
  • Grouped convolutions are not supported in the TRUE_HALF_CONFIG (convDesc is CUDNN_DATA_HALF) data type configuration. As a workaround, the PSEUDO_HALF_CONFIG (convDesc is CUDNN_DATA_FLOAT) data type configuration can be used without losing any precision.
  • For the cudnnConvolutionBiasActivationForward() function, if the input cudnnActivationMode_t is set to enum value CUDNN_ACTIVATION_IDENTITY, then the input cudnnConvolutionFwdAlgo_t must be set to the enum value CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM.
  • When the user runs cudnnRNNForward* or cudnnRNNBackward* with FP32 input/output, on sm_70 or sm_72, with RNN descriptor's algo field set to CUDNN_RNN_ALGO_PERSIST_STATIC, and math type set to CUDNN_TENSOR_OP_MATH via cudnnSetRNNMatrixMathType(), then the results are incorrect.
  • When the user runs cudnnRNNForward* or cudnnRNNBackward* with FP32 input/output, on sm_70 or sm_72, with RNN descriptor's algo field set to CUDNN_RNN_ALGO_PERSIST_STATIC, and math type set to CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION via cudnnSetRNNMatrixMathType(), then the resulting performance is suboptimal.

Fixed Issues

The following issues have been fixed in this release:

  • The cudnnConvolutionBackwardData() function produced incorrect result under these conditions:
    • The algo input is set to CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 in cudnnConvolutionBwdDataAlgo_t, and
    • CUDNN_TENSOR_OP_MATH is selected.

      Under above conditions, the dgrad computation was giving incorrect results when the data is not packed and the data format is NCHW. This is fixed.

  • When the cudnnConvolutionFwdAlgo_t() was set to CONVOLUTION_FWD_ALGO_FFT_TILING then the function cudnnConvolutionForward() was leading to illegal memory access. This is now fixed.

  • cudnnPoolingBackward() was failing when using a large kernel size used for 'global_pooling' with NHWC I/O layout. This is fixed.
  • The below two items are fixed: If you set RNN mathtype to CUDNN_TENSOR_OP_MATH, and run RNN on sm6x or earlier hardware:
    • a. You may have received CUDNN_STATUS_NOT_SUPPORTED when algo selected is CUDNN_RNN_ALGO_STANDARD or CUDNN_RNN_ALGO_PERSIST_STATIC.
    • b. You may have received incorrect results when algo selected is CUDNN_RNN_ALGO_PERSIST_DYNAMIC.
  • If you passed in variable sequence length input tensor to cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNBackwardData(), and used CUDNN_RNN_ALGO_PERSIST_STATIC or CUDNN_RNN_ALGO_PERSIST_DYNAMIC, then you may have received incorrect results. Now this is being checked, and CUDNN_STATUS_NOT_SUPPORTED will be returned.