cuDNN Release Notes v7.5.0

Key Features and Enhancements

The following features and enhancements have been added to this release:

  • In cudnnConvolutionForward() for 2D convolutions, for wDesc NCHW, the IMPLICIT_GEMM algorithm (algo 0) now supports the Data Type Configuration of INT8x4_CONFIG, and INT8x4_EXT_CONFIG also.
  • A new set of APIs are added to provide support for Multi-Head Attention computation. The following is a list of the new functions and data types:

    Datatypes:

    • cudnnSeqDataAxis_t
    • cudnnMultiHeadAttnWeightKind_t
    • cudnnSeqDataDescriptor_t
    • cudnnWgradMode_t
    • cudnnAttnQueryMap_t
    • cudnnAttnDescriptor_t

    Functions:

    • cudnnCreateAttnDescriptor
    • cudnnDestroyAttnDescriptor
    • cudnnSetAttnDescriptor
    • cudnnGetAttnDescriptor
    • cudnnGetMultiHeadAttnBuffers
    • cudnnGetMultiHeadAttnWeights
    • cudnnMultiHeadAttnForward
    • cudnnMultiHeadAttnBackwardData
    • cudnnMultiHeadAttnBackwardWeights
    • cudnnSetSeqDataDescriptor
    • cudnnGetSeqDataDescriptor
    • cudnnCreateSeqDataDescriptor
    • cudnnDestroySeqDataDescriptor
  • A new set of APIs for general tensor folding is introduced. The following is a list of the new functions and data types:

    Datatypes:

    • cudnnTensorTransformDescriptor_t
    • cudnnFoldingDirection_t

    Functions:

    • cudnnTransformTensorEx
    • cudnnCreateTensorTransformDescriptor
    • cudnnDestroyTensorTransformDescriptor
    • cudnnInitTransformDest
    • cudnnSetTensorTransformDescriptor
    • cudnnGetTensorTransformDescriptor
  • A new set of APIs, and enhancements for the existing APIs, are introduced for RNNs. The following is the list of the new and enhanced functions and data types:

    Datatypes:

    • cudnnRNNBiasMode_t (new)
    • cudnnRNNMode_t (enhanced)

    Functions:

    • cudnnSetRNNBiasMode (new)
    • cudnnGetRNNBiasMode (new)
    • cudnnGetRNNLinLayerBiasParams (enhanced)
  • All cudnnRNNForward/Backward* functions are enhanced to support FP16 math precision mode when both input and output are in FP16. To switch to FP16 math precision, set the mathPrec parameter in cudnnSetRNNDescriptor to CUDNN_DATA_HALF. To switch to FP32 math precision, set the mathPrec parameter in cudnnSetRNNDescriptor to CUDNN_DATA_FLOAT. This feature is only available for CUDNN_ALGO_STANDARD and for the compute capability 5.3 or higher.
  • Added support for INT8x4 and INT8x32 data type for cudnnPoolingForward. Using these will provide improved performance over scalar data type.

Fixed Issues

The following issues have been fixed in this release:

  • The mathPrec parameter in cudnnSetRNNDescriptor is reserved for controlling math precision in RNN, but was not checked or enforced. This parameter is now strictly enforced. As a result, the following applies:
    • For the input/output in FP16, the parameter mathPrec can be CUDNN_DATA_HALF or CUDNN_DATA_FLOAT.
    • For the input/output in FP32, the parameter mathPrec can only be CUDNN_DATA_FLOAT, and
    • For the input/output in FP64, double type, the parameter mathPrec can only be CUDNN_DATA_DOUBLE.
  • Users upgrading to cuDNN 7.4 may see insufficiently small values returned from the function cudnnGetConvolutionBackwardFilterWorkspaceSize () for dimensions 5 and greater, resulting in a CUDNN_STATUS_EXECUTION_FAILED error message. In cuDNN 7.4, the workaround for this issue is to calculate the workspace by using the formula below:

    Let M be the product of output tensor (gradDesc) dimensions starting at 1.
    Let N be the output tensor dimension 0.
    Let Mp = (M+31)/32
    Let Np = (N+31)/32
    W = 2 * M * N * sizeof(int) is the workspace that should be used.

    This is fixed.

  • In earlier cuDNN versions, when all the conditions below are true:
    • 3-D convolution
    • Batch size > 1
    • Algorithm is "CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1"
    • convDesc's dataType is CUDNN_DATA_HALF,

      then, calls to ​​cudnnConvolutionBackwardFilter() may produce incorrect (and non-deterministic) results. This is fixed in cuDNN 7.5.0.
  • In cuDNN 7.4.2, for some cases the 3D convolution resulted in a reduced performance on Turing GPUs, compared to the previous cuDNN releases. This is fixed.
  • For int8x32 datatype, the function cudnnSetTensor4dDescriptorEx erroneously returns CUDNN_STATUS_BAD_PARAM. Now it is fixed in cuDNN 7.5 so it no longer returns bad param.
  • In cuDNN 7.4.1 and 7.4.2, when cudnnBatchNormMode_t is set to CUDNN_BATCHNORM_SPATIAL_PERSISTENT and the input/output tensors are in NHWC format and of CUDNN_DATA_HALF datatype, then, on Windows only, the cudnnBatchNormalization*Ex functions are supported only with the device in TCC mode. See Tesla Compute Cluster Mode for Windows .

    Starting with cuDNN 7.5.0, the following checks are added for the driver mode on Windows. If on Windows and not in TCC mode:

    • The functions will fallback to a slower implementation if bnOps in the cudnnBatchNormalization*Ex function is set to CUDNN_BATCHNORM_OPS_BN.
    • If bnOps is set to CUDNN_BATCHNORM_OPS_BN_ACTIVATION, or CUDNN_BATCHNORM_OPS_BN_ADD_ACTIVATION, the CUDNN_STATUS_NOT_SUPPORTED is returned.
  • In cuDNN 7.4.2, in some cases the cudnnConvolutionBackwardData() function, when used with NHWC tensor format, resulted in the “disallowed mismatches” error. This is fixed.
  • In some cases, using cudnnConvolutionBiasActivationForward() with GroupCount() > 1 and xDesc's data type is CUDNN_DATA_HALF will produce incorrect results for all groups except the first. This is fixed.
  • When using cuDNN 7.3.1 on Quadro P4000, when calling the cudnnConvolutionForward() function with CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED algorithm, there was a small chance of seeing intermittent inaccurate results. This is fixed.
  • When cudnnConvolutionForward() is called with these settings: Datatype is CUDNN_DATA_INT8x4, Convolution is 2D, architecture is sm_61, filter size is larger than 8x8, then incorrect result and potential illegal memory access error occurs. This is fixed.
  • For sm_72 and sm_75, the function cudnnConvolutionBiasActivationForward(), when used with INT8x32, failed to run. This is fixed.
  • In the function cudnnSetRNNDataDescriptor , if API logging is turned on, the seqLengthArray field in the log may not display the correct number of array elements. This is fixed.
  • For the batchNorm functions cudnnBatchNormalization{Backward|BackwardEx|ForwardInference|ForwardTraining|ForwardTrainingEx}, the value of epsilon is required to be greater or equal to CUDNN_BN_MIN_EPSILON which was defined in the cudnn.h file to the value 1e-5. This threshold value is now lowered to 0.0 to allow a wider range of epsilon value. However, users should still choose the epsilon value carefully, since a too small a value of epsilon may cause batchNormalization to overflow when the input data's standard deviation is close to 0.
  • Some Grouped Convolutions (particularly those used in Depthwise-Separable convolutions) may return INTERNAL_ERROR if they have all inputs/outputs as NHWC-packed and do not match one of the following criteria:
    • filter_height = 1, filter_width = 1, vertical_conv_stride = 1, horizontal_conv_stride = 1
    • filter_height = 3, filter_width = 3, vertical_conv_stride = 1, horizontal_conv_stride = 1
    • filter_height = 3, filter_width = 3, vertical_conv_stride = 2, horizontal_conv_stride = 2

Known Issues

The following issues and limitations exist in this release:

  • The RNN persist-static algorithm returns incorrect results for GRU problems in backwards mode, when the hidden size is greater than 1024. Due to this, RNN persist-static algorithm is disabled in cuDNN 7.5.0. Users with such GRU problems are advised to use the standard or persist-dynamic RNN algorithms. See cudnnRNNAlgo_t(). This note applies to all previous cuDNN 7 releases.
  • The function cudnnConvolutionBackwardFilter(), when used with CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1, returns the error "Uninitialized __global__ memory read of size 4".