cuDNN Release 8.x.x

cuDNN Release 8.2.4

This is the cuDNN 8.2.4 release notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size which will require resolution in future releases, either through static sub-libraries or relaxing the whole-archive linkage requirement altogether.
  • Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all tensor shapes even though the cuDNN documentation says otherwise.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • Compared to cuDNN 7.6.5, there are known performance regressions on various convolutional models using INT8 data types on NVIDIA Volta GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, etc. have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • The documentation for cudnnReorderFilterAndBias() needs some corrections for clarity.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • NVIDIA Turing users of cuDNN can observe intermittent illegal memory access errors for some convolution workloads.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • In a multi-GPU setting, with complex scheduling, cuDNN can segfault. It is not clear that this is a cuDNN issue, but the issue is under active investigation so that the known limitations section of this document can be updated in a future release.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • There is a known 60% performance regression for ResNet-50 on the GTX 1080 when run using fp16 data with large batch sizes (over 128).
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking. This will come at a cost to the binary size which will require resolution in a future release, either through static sub-libraries or relaxing the whole-archive linkage requirement altogether.
  • cuDNN may contain a small memory leak related to the usage of dlopen() within the library; this is not confirmed but currently under investigation. The possible leak does not affect Windows users or users of the static library.
  • On Pascal and Maxwell architectures, users of cuDNN's 8.0 backend engine 34 for forward convolution can witness illegal memory access when this engine is specifically selected outside of heuristic query. Heuristics users of these architectures will not witness this issue, as happened in previous versions. The possibility of the illegal memory access affects all previous versions of cuDNN 8.0 and will be fixed in a future release.
  • When using the cuDNN CTC Loss API function, the computed gradients array is not zero initialized. When sequence lengths are exceeded, some gradient entries are returned uninitialized.
  • The internal CUDA streams inside cuDNN 8.3.0 will have the same priority (instead of the default priority) as the user stream that is set by cudnnSetStream(), while an exception/limitation is that they will have priority as (highest - 1) for the user stream with the highest priority. This is true only when the user stream is NOT in capture mode (cudaStreamCaptureStatusActive), otherwise the behavior does not change.
  • Fusion engine operation mode 16 is not currently supported on Windows; this will be fixed in a future release.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub-library prior to opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.
  • Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.

cuDNN Release 8.2.2

This is the cuDNN 8.2.2 release notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Experimental runtime fusion heuristics are now supported to facilitate an intelligent and efficient heuristic recommendation based on the predicted execution time for the runtime fusion engine. The current coverage is limited to fusion patterns involving a convolution forward operation in FP16 mixed precision configuration on NVIDIA Ampere GPUs. We will continue to expand the support and improve the prediction accuracy in future releases.
  • The cuDNN runtime fusion now supports pure pointwise fusion or pointwise plus reduction fusion. It supports FP16/FP32 as IO type and FP32 as compute type.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Fixed Issues

The following issues have been fixed in this release:
  • For platforms that ship a compiler version older than GCC 6 by default, linking to static cuDNN using the default compiler is not supported.
  • There was a 15% performance regression for inference on the PyTorch WaveGlow model on the NVIDIA Turing architecture. This regression has been fixed.
  • The convolve_common_engine_int8_NHWC kernel had an undesired FP32 > INT32 truncation before outputting the FP32 result directly. This issue has been fixed in this release.
  • In previous cuDNN versions, cudnnRNNBackwardData(), cudnnRNNBackwardDataEx(), or cudnnRNNBackwardData_v8() could return CUDNN_STATUS_INTERNAL_ERROR when CUDNN_RNN_ALGO_PERSIST_STATIC and CUDNN_LSTM were selected. This issue occurred mainly on smaller GPUs, such as Turing with 36 or 48 SMs and smaller hiddenSize values. This issue has been fixed in this release.
  • NVIDIA Turing GTX 16xx users of cuDNN would observe invalid values in convolution output. This issue has been fixed in this release.
  • Users would experience NCHW transformations causing a floating point exception and the CPU reference code producing incorrect results for tensor format CUDNN_TENSOR_NCHW_VECT_C. Corner cases in the convolution sample code have been fixed in this release.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size which will require resolution in future releases, either through static sub-libraries or relaxing the whole-archive linkage requirement altogether.
  • Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all tensor shapes even though the cuDNN documentation says otherwise.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • Compared to cuDNN 7.6.5, there are known performance regressions on various convolutional models using INT8 data types on NVIDIA Volta GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, etc. have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • The documentation for cudnnReorderFilterAndBias() needs some corrections for clarity.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • NVIDIA Turing users of cuDNN can observe intermittent illegal memory access errors for some convolution workloads.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • In a multi-GPU setting, with complex scheduling, cuDNN can segfault. It is not clear that this is a cuDNN issue, but the issue is under active investigation so that the known limitations section of this document can be updated in a future release.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • There is a known 60% performance regression for ResNet-50 on the GTX 1080 when run using fp16 data with large batch sizes (over 128).
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking. This will come at a cost to the binary size which will require resolution in a future release, either through static sub-libraries or relaxing the whole-archive linkage requirement altogether.
  • cuDNN may contain a small memory leak related to the usage of dlopen() within the library; this is not confirmed but currently under investigation. The possible leak does not affect Windows users or users of the static library.
  • On Pascal and Maxwell architectures, users of cuDNN's 8.0 backend engine 34 for forward convolution can witness illegal memory access; this affects all previous versions of cuDNN 8.0 and will be fixed in a future release.
  • When using the cuDNN CTC Loss API function, the computed gradients array is not zero initialized. When sequence lengths are exceeded, some gradient entries are returned uninitialized.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub-library prior to opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.
  • Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.

Deprecated Features

The following features are deprecated in cuDNN 8.2.2:
  • Support for Ubuntu 16.04 has been deprecated in cuDNN 8.2.2 for CUDA 11.4. For a list of supported operating systems, refer to the cuDNN Support Matrix.
  • Support for RHEL7 for ppc64le has been deprecated in cuDNN 8.2.2 for CUDA 11.4. For a list of supported operating systems, refer to the cuDNN Support Matrix.

cuDNN Release 8.2.1

This is the cuDNN 8.2.1 release notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • The cuDNN runtime fusion engine now supports generating Tensor Core kernels with input tensors of:
    • Bfloat16 type and compute precision of FP32 (requires compute capability 8.0 or above). For Bfloat16 support, convolution input/output channels are required to be a multiple of 8.
    • INT8 and compute precision of INT32 (requires compute capability 7.5 or above) datatype and in NHWC layout. For INT8 support, the convolution input/output channels are required to be a multiple of 16, and unlike the NCHW_VECT_C kernels, filter and bias reordering is not required.

    In the fused pointwise/reduction operations, FP32 is the compute precision supported.

  • The cuDNN runtime fusion engine has added experimental Tensor Core kernel generation support for NVIDIA Volta (compute capability 7.0) and NVIDIA Xavier (compute capability 7.2). The supported input tensor data type is FP16, compute precision is FP32, and the supported layout is NHWC. However, reduction fusion is not yet supported and we are working on further generalizing the support.
  • Equations in the documentation are now supported in Chrome.
  • The backend API now supports fused convolution-scale-bias-activation with per-channel-scaling by matching the operation graph.
  • cudnnPoolingBackward() allows both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Fixed Issues

The following issues have been fixed in this release:
  • In some cases, NVIDIA Ampere users of cuDNN 8.1 cudnnGetConvolutionBackwardFilterAlgorithm_v7() could receive a workspace that was insufficient for computing the calculation with cudnnConvolutionBackwardFilter(). This issue has been fixed in this release.
  • Many convolution models were experiencing lower performance on RTX 3090 compared to 2080 Ti. This included ResNet-50 with up to 2x performance difference and ResNeXt up to 10x performance difference. Many of these performance issues have been fixed in this release.
  • Compared to cuDNN version 8.0.5, there was a known 8% performance regression on the SSD ResNet-50 model on the NVIDIA Ampere architecture. This issue has been fixed in this release.
  • L4T users of cuDNN could observe CUDNN_STATUS_EXECUTION_FAILED errors in some cases when performing convolutions using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM. This issue has been fixed in this release.
  • In cuDNN 8.2.0, if the user runs a Bi-directional RNN network with dropout enabled, the user may see non-deterministic outputs. This issue has been fixed in 8.2.1.
  • There was a known 18% performance regression for inference on the PyTorch ResNet-50 v1.5 model on the NVIDIA Turing architecture. This issue has been fixed in this release.
  • Known regressions on certain layers in cuDNN 8 regression in algorithm selection heuristics have been fixed on Volta and Pascal platforms.
  • In older versions of cuDNN, when calling the API cudnnSetDropoutDescriptor(), a kernel launched by this API used to require a substantial amount of GPU memory for the stack. The memory is released when the kernel finishes and the stack size is changed back in a way that’s not thread safe. Starting in the 8.2.1 release, the extra memory is no longer required, and as a result, the thread safety concern is no longer present.
  • In cuDNN 8.1.1, compared to cuDNN 8.1.0, there was a known regression in performance of the runtime fusion engine for convolution fused with ReLU in the epilog. This was caused due to the generalized support for parameterized ReLU. This issue has been fixed since the 8.2.0 release.
  • Since cuDNN 8.0.4 until 8.2.0, certain SKUs of V100 GPU may encounter CUDNN_STATUS_EXECUTION_FAILED status or unspecified launch failure in a subsequent call to cudaDeviceSynchronize() when running RNN with cell mode of CUDNN_LSTM and CUDNN_RNN_ALGO_PERSIST_STATIC algorithm. This issue has been fixed in this release.
  • Between cuDNN 8.1.0 and 8.2.0, if the user runs cudnnRNN*() API under CUDA compute sanitizer with CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H algorithm, users may see errors like Invalid __global__ read reported by the CUDA compute sanitizer. This issue has been fixed in this release.
  • Compared to cuDNN 8.0.0 preview, there is a known ~12% performance regression on vgg16 when run on Jetson Nano and TX2. This issue has been fixed in this release.
  • Compared to cuDNN 7.6, there is a significant performance regression on Darknet when run on Jetson Nano. This issue has been fixed in this release.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size which will require resolution in future releases, either through static sub-libraries or relaxing the whole-archive linkage requirement altogether.
  • Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all tensor shapes even though the cuDNN documentation says otherwise.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • Compared to cuDNN 7.6.5, there are known performance regressions on various convolutional models using INT8 data types on NVIDIA Volta GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, etc. have not been documented as of yet.
  • It is possible, starting in cuDNN v7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 15% performance regression for inference on the PyTorch WaveGlow model on the NVIDIA Turing architecture.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • The documentation for cudnnReorderFilterAndBias() needs some corrections for clarity.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • NVIDIA Turing GTX 16xx users of cuDNN can observe invalid values in convolution output.
  • NVIDIA Turing users of cuDNN can observe intermittent illegal memory access errors for some convolution workloads.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • In a multi-GPU setting, with complex scheduling, cuDNN can segfault. It is not clear that this is a cuDNN issue, but the issue is under active investigation so that the known limitations section of this document can be updated in a future release.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • There is a known 60% performance regression for ResNet-50 on the GTX 1080 when run using fp16 data with large batch sizes (over 128).

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub-library prior to opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.
  • Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.

Deprecated Features

The following features are deprecated in cuDNN 8.2.1:
  • Support for Ubuntu 16.04 will be deprecated in cuDNN 8.2.2 for CUDA 11.4. For a list of supported operating systems, refer to the cuDNN Support Matrix.
  • Support for RHEL7 for ppc64le will be deprecated in cuDNN 8.2.2 for CUDA 11.4. For a list of supported operating systems, refer to the cuDNN Support Matrix.

cuDNN Release 8.2.0

This is the cuDNN 8.2.0 release notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Convolution with the backend API now supports tensor with more than 2**31 elements. The size and stride of each tensor dimension are still limited to 32 bit values.
  • Convolution Heuristics Generalization has been improved for several GPUs. These improvements can be observed in the legacy API and version 8 API. In the version 8 API, these improvements are available in both CUDNN_HEUR_MODE_INSTANT and CUDNN_HEUR_MODE_B.
  • The cuDNN runtime fusion engine now supports generating Tensor Core based fusion kernels in the following scenarios:
    • When there is a scale+bias+relu pattern in the graph fused to the x input of a convolution forward operation.
    • When the graph contains 3D convolution forward, backward data, or backward filter operation.
    • When the graph contains a convolution backward data operations with non-unit convolution strides.

    We are working on further generalization of this support.

  • cuDNN C++ frontend has released the 0.2 version which adds more general support to activation forward and backward operations, matMul operation, and contains various bug fixes and clean ups. A few runtime fusion samples have also been added. For more information, refer to GitHub: cuDNN frontend.
  • The new RNN ALGO_STANDARD implementation and heuristics tuning provides significant speed up (up to 100%), especially when the overall problem size is small (hidden size, batch size, and the number of timesteps).
  • The RNN dropout implementation has been heavily optimized. The new implementation brings significant speed-up to all RNN algorithms when dropout is enabled.
  • cuDNN RNN has moved to calling cuBLASLt on newer architectures (compute capability >= 7.0). As a result, the CUBLAS_WORKSPACE_CONFIG workaround for cuBLAS non-deterministic behavior is no longer needed on those architectures. In addition, under repeated CUDA graph capture, cuBLASLt no longer allocates workspace repeatedly like cuBLAS.
  • In the cuDNN v8 backend API, a new CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE attribute has been added. Users can query the engine behaviors via this attribute similar to the numerical behaviors queried through the CUDNN_ATTR_ENGINE_NUMERICAL_NOTE attribute. Currently, the engine behavior note only shows whether an engine does runtime compilation or not. More behaviors may be added in future releases.
  • cuDNN API logging for the v8 backend API has been significantly improved. Now more detailed information can be printed from the backend data structures, for example, tensors, perations, engines, and execution plans. We hope this can improve the development and debugging experience of cuDNN. Refer to this link for more instructions of how to enable API logging.
  • cuDNN now supports SWISH activation in both forward and backward directions. It can be configured for use with cudnnActivationForward() and cudnnActivationBackward() by using the enumerate CUDNN_ACTIVATION_SWISH with cudnnSetActivationDescriptor(). SWISH activation's parameter, commonly known as beta, may further be set using the newly added API function cudnnSetActivationDescriptorSwishBeta() and queried with cudnnGetActivationDescriptorSwishBeta().

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Fixed Issues

The following issues have been fixed in this release:
  • There was a performance regression in certain use cases comparing RTX 3090 using cuDNN version 8.x to RTX 2080 Ti using cuDNN version 7.x. This regression has been fixed in this release.
  • There was a performance regression in the runtime engine for convolution fused with ReLU in the epilog in cuDNN 8.1.1. This regression has been fixed in this release.
  • Compared to cuDNN version 7.6.5, there was a performance regression in certain grouped ConvolutionBackwardFilter cases on the NVIDIA Volta GPU architecture. This regression has been fixed in this release.
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT returned an internal error when the number of channels in the filter was greater than or equal to 65536. This issue has been fixed in this release.
  • Compared to cuDNN version 8.0.2, there was a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case. This issue has been fixed in this release.
  • Although the overall cuDNN library size has improved in cuDNN 8.1.0 with CUDA Toolkit 11.2 and greater, as compared to cuDNN 8.0.x, the cuDNN library remains large. We have attempted to moderate the severity of this issue in this release.
  • Many convolution models were experiencing lower performance on RTX 3090 compared to 2080 Ti. This included ResNet-50 with up to 2x performance difference, ResNeXt up to 10x performance difference and U-Net up to 3x performance difference. The performance issues have been fixed in this release.
  • The ResNet-50 native FP32 inference issues have been fixed on Volta, Turing, and NVIDIA Ampere GPU architectures.
  • We’ve eliminated anonymous structs in cuDNN public headers cudnn_cnn_infer.h, cudnn_cnn_train.h, and cudnn_ops_infer.h to allow forward struct declarations. The following five typedef-s were updated: cudnnConvolutionFwdAlgoPerf_t, cudnnConvolutionBwdDataAlgoPerf_t, cudnnConvolutionBwdFilterAlgoPerf_t, cudnnAlgorithm_t, and cudnnDebug_t
  • cudnnActivationForward() could generate illegal memory access errors for tensors of more than 2**30 elements in the previous version of cuDNN 8. This issue has been fixed in this release.
  • In previous releases, cudnnRNNBackwardWeights(), cudnnRNNBackwardWeightsEx(), and cudnnRNNBackwardWeights_v8() may generate wrong and non-deterministic results when dropout is enabled. A stream dependency issue has been fixed in the current release so users will no longer observe this issue.
  • The heuristics in cudnnConvolutionBackwardFilter() have been improved for generalized cases. We have observed several convolution cases with up to ~100x performance improvements compared to cuDNN version 8.1.
  • Compared to cuDNN 8.0.4, there was a known ~6% performance regression on ONNX-WaveGlow when run on Titan RTX. This issue has been fixed in this release.
  • Compared to cuDNN 7.6, there were known performance regressions up to 2x on select configurations for AlexNet-like models on Turing GPUs. This issue has been fixed in this release.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size which will require resolution in future releases, either through static sub-libraries or relaxing the whole-archive linkage requirement altogether.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all broadcast-able tensor shapes even though the cuDNN documentation says otherwise.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Compared to cuDNN 8.0.0 Preview, there is a known ~12% performance regression on vgg16 when run on Jetson Nano & TX2.
  • Compared to cuDNN 8.0.4, there is a known ~6% performance regression on ONNX-WaveGlow when run on Titan RTX.
  • Compared to cuDNN 7.6, there is a significant performance regression on Darknet when run on Jetson Nano.
  • Convolutions (ConvolutionForward, ConvolutionBackwardData and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on Turing GPUs.
  • L4T users of cuDNN may observe CUDNN_STATUS_EXECUTION_FAILED errors in some cases when performing convolutions using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM. This issue is being investigated.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking. This will come at a cost to the binary size which will require resolution in a future release, either through static sub-libraries or relaxing the whole-archive linkage requirement altogether.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • Compared to cuDNN 7.6.5, there are known performance regressions on various convolutional models using INT8 data types on NVIDIA Volta GPUs.
  • Compared to cuDNN 8.1.0, there is a known regression in performance of the runtime fusion engine for convolution fused with ReLU in the epilog. This is caused due to the generalized support for parameterized ReLU. Further optimizations are being worked on.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, etc. have not been documented as of yet. This is being worked on and will be resolved in a future release.
  • It is possible, starting in cuDNN v7.6, to leak memory when computing common convolution operations in rare cases.
  • There is a known 15% performance regression for inference on the PyTorch WaveGlow model on the NVIDIA Turing architecture.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • There is a known 18% performance regression for inference on the PyTorch ResNet-50 v1.5 model on the NVIDIA Turing architecture.
  • The documentation for cudnnReorderFilterAndBias() needs some corrections for clarity. This will be fixed in a future release.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsInferVersionChec()) to load the kernels in the sub-library prior to opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.

cuDNN Release 8.1.1

This is the cuDNN 8.1.1 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • The runtime fusion engine now supports the canonical NCHW/KCRS/NKPQ format for describing a tensor, in addition to the version 8 format that has the explicit group dimension NGCHW/GKCRS/NGKPQ that is already supported.
  • The runtime fusion engine now supports NVIDIA Ampere architecture cards with compute capability 86 (i.e. GA10x) in addition to compute capability 80 (GA100) and compute capability 75 (Tu10x).
  • The runtime fusion engine fully supports fusing configurable ReLU (a generalization of ReLU, clipped ReLU, and leaky ReLU), Tanh, Sigmoid, configurable EluGelu, configurable Softplus and configurable Swish forward and backward activations into the epilog of a convolution forward, a convolution backward data or a matrix multiplication operation.
  • The runtime fusion engine now fully supports [N, H, W, C] to [1, 1, 1, C] reduction and [N, H, W, C] to [N, H, W, 1] reduction on the output of a convolution forward, a convolution backward data operation, and [1, M, N] to [1, M, 1] and [1, M, N] to [1, 1, N] reduction in matrix multiplication operations. For convolution backward filter operation, [N, H, W, C] to [1, H, W, C] reduction and [N, H, W, C] to> [N, 1, 1, 1]reduction are supported. The supported reduction operators are CUDNN_REDUCE_TENSOR_ADD, CUDNN_REDUCE_TENSOR_MIN, and CUDNN_REDUCE_TENSOR_MAX.
  • API logging in cudnnBackendExecute() has been greatly improved to print out the internal information in descriptors like operation graphs, engines, and execution plans.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Fixed Issues

The following issues have been fixed in this release:
  • cudnnConvolutionBackwardData() could in some cases on the Turing and Volta architectures perform operations on the GPU resulting in illegal memory access. This issue was fixed in version 8.0 and subsequent releases.
  • When running a convolution forward, convolution backward data/weights, or a matrix multiplication fusion with pointwise and reduction operations with engine index 0, the runtime fusion engine used to be allowed to run on the CUDA Toolkit 10.2. However, not all the features it relies upon are supported in the CUDA Toolkit 10.2. For better stability and targeted optimizations, the engine now requires CUDA Toolkit 11.2 update 1. We have blocked the engine from running in cuDNN built against CUDA Toolkit 10.2. See the Limitations section for more details.
  • The supported check in the runtime fusion engine has been improved to return proper error codes in currently unsupported operation graphs, such as:
    • an operation graph that contains more than one convolution of matrix multiplication operations
    • convolutions with compute type that is not FP32
    • grouped convolutions
    • when the convolution mode is not CUDNN_CROSS_CORRELATION
  • When running a convolution backward data/weights fusion with pointwise and reduction operations with engine index 0, the runtime fusion engine may be launching kernel with more shared memory specified than necessary, causing sub-optimal performance. This issue has been fixed in this release.
  • Execution of a plan for convolution forward operation graph, with engine global index 1 returned CUDNN_STATUS_INTERNAL_ERROR when the filter format is NHWC and padding was larger than zero. This issue has been fixed in this release.
  • Compared to cuDNN version 8.0.5, there was a known performance regression of 10-50% on Tacotron2 and WaveGlow models. This issue has been fixed in this release.
  • cudnnConvolutionBiasActivationForward() does not invoke TF32 Tensor Core kernels when the math type in the convolution descriptor is set to CUDNN_DEFAULT_MATH. This leads to suboptimal performance under this math mode. This issue has been fixed in this release.
  • Fixed an issue where the version 8 graph API’s execution plan descriptor may internally refer to a descriptor outside of the data structure, which can cause unexpected errors when the external descriptors have been destroyed. Now all the information is recorded within the data structure.
  • There was a performance regression where NHWC was slower than NCHW on 3D convolution up to 40% on V100 and NVIDIA A100 GPUs. This issue has been fixed in this release.
  • There was a performance regression in certain use cases comparing RTX 3090 using cuDNN version 8.x to RTX 2080 Ti using cuDNN version 7.x. This regression has been fixed in this release.
  • Compared to cuDNN version 7.6, there were known performance regressions up to 2x on select configurations for AlexNet-like models on Volta and Ampere GPUs. These regressions has been fixed in this release.
  • When calling: the API will crash when called with CUDNN_RNN_ALGO_PERSIST_DYNAMIC algo but the cudnnPersistentRNNPlan_t was not created. This has been fixed in this release.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size which will require resolution in future releases, either through static sub-libraries or relaxing the whole-archive linkage requirement altogether.
  • The ResNet-50 native FP32 inference issues have been fixed on Volta and Turing. Few performance regressions exist in the NVIDIA Ampere GPU architecture.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all broadcast-able tensor shapes even though the cuDNN documentation says otherwise.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Compared to cuDNN 8.0.0 Preview, there is a known ~12% performance regression on vgg16 when run on Jetson Nano & TX2.
  • Compared to cuDNN 8.0.4, there is a known ~6% performance regression on ONNX-WaveGlow when run on Titan RTX.
  • Compared to cuDNN 7.6, there is a significant performance regression on Darknet when run on Jetson Nano.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case. We are not aware of any popular model which uses this unique use case.
  • Although the overall cuDNN library size has improved in cuDNN 8.1.0 with CUDA Toolkit 11.2 and greater as compared to cuDNN 8.0.x, the cuDNN library remains large; future releases will attempt to moderate the severity of this issue.
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT returns an internal error when the number of channels in the filter is greater than or equal to 65536.
  • Execution of a plan for convolution forward operation graph, with engine global index 1 returns CUDNN_STATUS_INTERNAL_ERROR when the filter format in NHWC and padding is larger than zero.
  • Convolutions (ConvolutionForward, ConvolutionBackwardData and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on Turing GPUs.
  • cudnnActivationForward() can cause an illegal memory access CUDA error for tensors with more than 2**30 elements.
  • L4T users of cuDNN may observe CUDNN_STATUS_EXECUTION_FAILED errors in some cases when performing convolutions using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM. This issue is being investigated.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking. This will come at a cost to the binary size which will require resolution in a future release, either through static sub-libraries or relaxing the whole-archive linkage requirement altogether.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • Compared to cuDNN 7.6.5, there are known performance regressions on various convolutional models using INT8 data types on NVIDIA Volta GPUs.
  • Compared to cuDNN 8.1.0, there is a known regression in performance of the runtime fusion engine for convolution fused with ReLU in the epilog. This is caused due to the generalized support for parameterized ReLU. Further optimizations are being worked on.

Limitations

  • The runtime fusion engine was only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1; it now requires the NVRTC from CUDA 11.2 update 1. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsInferVersionChec()) to load the kernels in the sub-library prior to opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.

cuDNN Release 8.1.0

This is the cuDNN 8.1.0 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • A preview of the cuDNN runtime operation fusion capabilities is included in this release. This feature is exposed as a new backend engine in the version 8.0 graph API. With runtime op fusion, the engine can generate and compile fused tensor-core kernels on the fly for the specified operation graph during the execution plan finalization stage. Some of the operation graph patterns supported in this preview are: convolution or matrix multiplication operation with arbitrary combination of one or more pointwise operations, and reduction operations fused onto the output tensor. Examples include but are not limited to conv-bias-leaky_relu, and gemm-bias-gelu. This feature is supported on GPUs with compute capability 7.5 and 8.0. The current implementation supports FP16 I/O with FP32 compute or FP32 (TF32) I/O with FP32 compute. In this release, the support for this feature is restricted to Linux on x86-64. We will continue to work on this feature to provide additional support and improved performance. In the meantime, we welcome your feedback. Email:cudnn@nvidia.com
  • We’ve released our C++ frontend via GitHub which implements a series of classes wrapping around the v8 backend C API. The user only needs to include a few headers to enjoy the convenience from graph construction, heuristics query to execution. The frontend also implements a significantly improved autotuning feature that can accurately time the executions from a list of functionally equivalent implementations and return the fastest implementation.
  • Heuristics have been improved for TF32 and PSEUDO_HALF (with Tensor Core enabled) convolutions. On one known model, performance was improved 1.3x (when not auto-tuning). On select cases in several models, we have seen performance improvements up to ~50x.
  • Added support for PSEUDO_BFLOAT16_CONFIG on NVIDIA Ampere GPU architecture for CNNs. While most of the algos/engines which are available for PSEUDO_HALF_CONFIG are available for PSEUDO_BFLOAT16_CONFIG, a few are not available. The available engines for PSEUDO_BFLOAT16_CONFIG can achieve at least 90% performance of PSEUDO_HALF_CONFIG for layers in standard models. There is a known limitation for layers having 3 or 4 channels for the filter and convolution of stride 2, such as the first layer of ResNet.
  • EfficientNet performances have improved. Depthwise convolution is now optimized in NHWC layout in cuDNN 8.1.0. From EfficientNet, we see an average of 2.9x speed-up for 5x5 layers, and 1.7x speed-up for 3x3 layers.
  • Added support for TF32 engines to compute operation graphs that match the fused conv-bias-activation pattern. TF32 kernels are also supported in cudnnConvolutionBiasActivationForward() API.
  • Added support for a new RNN algo CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H, specialized for small hidden sizes. It is expected to be faster than other algos for those small hidden sizes.
  • The cuDNN build against CUDA Toolkit 11.2 is now backward compatible with earlier CUDA 11 drivers, including 450, 455, in addition to the 460 driver.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Fixed Issues

The following issues have been fixed in this release:
  • Kernel calc_bias_diff_nhwc_packedhas a known functional issue from 8.0.2. It happens when running cudnnConvolutionBackwardBias(bgrad) with NHWC/NDHWC packed tensors and even C && (C >= 6). This issue was fixed in this release.
  • Kernel convolve_common_engine_int8 may cause accuracy degradation when running cudnnConvolutionBiasActivationForward() with INT8 in cuDNN version 8.0.5 due to the fact that there was not a rounding when converting the results from single to INT8. This issue was fixed in this release.
  • On some Turing GPUs, when users are running persistent RNN with hiddenSize greater than or equal to 768 but less than 1024, users may get incorrect results and see CUDA error 719, cudaErrorLaunchFailure the next time they call cudaDeviceSynchronize(). This bug has been fixed in this release.
  • Calling cudnnConvolutionBiasActivationForward() or executing a cuDNN backend plan for fused convolution-bias-activation operation graphs, can lead to a memory leak. This issue is fixed in the current release.
  • On Windows, calling API cudnnGetFoldedConvBackwardDataDescriptors() results in failure to find symbols. This issue has been present in all versions since cuDNN version 8.0.0 and is fixed in this release.
  • The backend convolution operation had external dependencies on the user created backend tensor descriptors even after finalization. Deletion of the tensor descriptors might cause the operation to seg-fault when constructing the operation graph. This bug affects all versions since cuDNN version 8.0.0, and has been fixed in 8.1.0
  • Many convolution models were experiencing lower performance on RTX 3090 compared to 2080 Ti. This included ResNet-50 with up to 2x performance difference, ResNeXt up to 10x performance difference and U-Net up to 3x performance difference. The performance issues have been fixed in this release.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size which will require resolution in future releases, either through static sub-libraries or relaxing the whole-archive linkage requirement altogether.
  • The ResNet-50 native FP32 inference issues have been fixed on Volta and Turing. Few performance regressions exist in the NVIDIA Ampere GPU architecture.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all broadcast-able tensor shapes even though the cuDNN documentation says otherwise.
  • Users have reported that in RNN training with non-zero dropout rate, and if the RNN network is unidirectional, the output of cudnnRNNBackwardWeights() may be non-deterministic. We are still investigating this issue.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Compared to cuDNN 8.0.0 Preview, there is a known ~12% performance regression on vgg16 when run on Nano & TX2.
  • Compared to cuDNN 8.0.4, there is a known ~6% performance regression on ONNX-WaveGlow when run on Titan RTX.
  • Compared to cuDNN 7.6, there is a significant performance regression on Darknet when run on Nano.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case. We are not aware of any popular model which uses this unique use case.
  • Although the overall cuDNN library size has improved in cuDNN 8.1.0 with CUDA Toolkit 11.2 and greater as compared to cuDNN 8.0.x, the cuDNN library remains large; future releases will attempt to moderate the severity of this issue.
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT returns an internal error when the number of channels in the filter is greater than or equal to 65536.
  • Execution of a plan for convolution forward operation graph, with engine global index 1 returns CUDNN_STATUS_INTERNAL_ERROR when the filter format in NHWC and padding is larger than zero.
  • Convolutions (ConvolutionForward, ConvolutionBackwardData and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on Turing, Volta, and NVIDIA Ampere GPU architecture.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsInferVersionChec()) to load the kernels in the sub-library prior to opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.

cuDNN Release 8.0.5

This is the cuDNN 8.0.5 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • RNN now supports zero-length sequences within the batch when the RNN data layout is CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED. For more information, see cudnnSetRNNDataDescriptor().
  • Users can now set the environment variable CUDNN_CONV_WSCAP_DBG to a value in MiB to limit the workspace size returned by cudnnConvolutionForwardGetWorkspaceSize(), cudnnConvolutionBackwardDataGetWorkspaceSize(), and cudnnConvolutionBackwardFilterGetWorkspaceSize(). Limiting the workspace might result in performance lost.
  • Significant performance improvements were made for RTX 3090 for many models on many configurations.
  • Performance improvements were made:
    • For EfficientNet when run using NHWC FP16 Tenor Core configurations on V100 and A100 GPU architectures.
    • For PilotNet, AH-Net, MobileNet V3 on V100 and A100 GPU architectures.
    • For various 3-D convolution cases on RTX 8000.
  • Support for the 3D NDHWC layout was added in cudnnConvolutionBackwardFilter().
  • Added instructions for installing cuDNN using the Package Manager for Linux and RHEL users. For step-by-step instructions, see Package Manager Installation in the cuDNN Installation Guide.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Fixed Issues

The following issues have been fixed in this release:
  • cudnnBackendFinalize(descriptor), where descriptor is of type CUDNN_BACKEND_ENGINE_DESCRIPTOR() or CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR(), might result in a hang if the operation graph has backward filter operation and the user links against libcudnn.so (cudnn64.dll on Windows). This issue has been fixed in this release.
  • Call to cudnnConvolutionBiasActivationForward() might result in a memory leak in release 8.0.1. This issue has been fixed.
  • Performance regression on the U-Net Industrial network on Volta for certain batch sizes has been fixed.
  • cudnnRNN*() with LSTM mode may produce incorrect results on the cy outputs when clipping is enabled on all GPUs. This issue also exists in previous cuDNN releases since version 7.2.1. This issue has been fixed in this release.
  • cudnnRNNForward* with LSTM mode may produce incorrect results in case of clipping when CUDNN_RNN_ALGO_PERSIST_STATIC is used. This issue also exists in previous cuDNN releases since version 7.2.1. This issue has been fixed in this release.
  • In previous cuDNN versions, cudnnRNNBackwardData() or cudnnRNNBackwardDataEx()may produce non-deterministic outputs when running configurations such as hiddenSize=128 or less, LSTM cell type, and FP32 with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION. This issue has been fixed in this release.
  • Compared to cuDNN 7.6, there was a known ~6% performance regression on Inception V3 and ResNet-50 models when run using NHWC FP16 configurations on various Turing and Titan V architectures. This issue has been fixed in this release.
  • Compared to cuDNN v8.0.3, there was a known ~18% performance regression on the U-Net Industrial model when run using NCHW TF32 configurations on V100 and A100 GPU architectures. This issue has been fixed in this release.
  • Updated: November 25, 2020

    When calling cudnnConvolutionBiasActivationForward() with INT8x4 or INT8x32 IO tensors, it could result in CUDNN_STATUS_BAD_PARAM in 8.0.4. This issue has been fixed in this release.

Known Issues

  • When using cudnnRNN* APIs with the problem sizes (input size, hidden size) not being multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users encountered a return status of CUDNN_STATUS_EXECUTION_FAILED in cudnn built against CUDA 11.0. This issue has been fixed with cuDNN built against CUDA 11.1.
  • The ResNet-50 native FP32 inference issues have been fixed on Volta and Turing. Few performance regressions exist in the NVIDIA Ampere GPU architecture.
  • cudnnAddTensor() does not support all broadcast-able tensor shapes even though the cuDNN documentation says otherwise.
  • Users have reported that in RNN training with non-zero dropout rate, and if the RNN network is unidirectional, the output of cudnnRNNBackwardWeights() may be non-deterministic. We are still investigating this issue.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Compared to cuDNN 8.0.0 Preview, there is a known ~12% performance regression on vgg16 when run on Nano & TX2.
  • Compared to cuDNN 8.0.4, there is a known ~6% performance regression on ONNX-WaveGlow when run on Titan RTX.
  • Compared to cuDNN 7.6, there is a significant performance regression on Darknet when run on Nano.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub-library version check API (e.g. cudnnOpsInferVersionChec()) to load the kernels in the sub-library prior to opening graph capture.

cuDNN Release 8.0.4

This is the cuDNN 8.0.4 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:
GA102 support with improved convolution performance
Now includes convolution heuristics targeting the NVIDIA GA102 GPU. (not applicable for Jetson platforms)
RNN API v8 sample
The new RNN sample illustrating the usage of the new RNN version 8 API has been added. The sample's workflow consists of the several routines to create RNN descriptors, create RNN data descriptors, set up weight space, and compute routines. The sample takes several input parameters which can set up different RNN configurations and input data specifications (data type, cell mode, bias mode etc.).
RNN functional and performance improvements
ARM Server Base System Architecture (SBSA)
Added support for ARM SBSA for Linux.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.4 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN 8.0.4 now require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 128-bit boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.4 compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.4 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.

Deprecated Features

The following features are deprecated in cuDNN 8.0.4:
  • Support for Ubuntu 18.04 ppc64le builds will be dropped post cuDNN 8.0.4.

Fixed Issues

  • cudnnConvolutionBackwardFilter() and cudnnGetConvolutionBackwardFilterWorkspaceSize() can result in a segmentation fault in multi-threaded usage due to a race condition. This issue has been fixed in this release.
  • The libfreeimage.a library in the RHEL 8 ppc64le RPM package was for the wrong architecture. This issue has been fixed in this release.
  • In previous cuDNN versions, cudnnRNNBackwardData() or cudnnRNNBackwardDataEx() may return CUDNN_STATUS_INTERNAL_ERROR, NaN-s, or non-deterministic finite values when CUDNN_RNN_ALGO_PERSIST_STATIC was selected. These issues occurred mainly on smaller GPUs, such as Turing with 30 or 36 SMs and smaller hiddenSize values. Most of those issues have been fixed in this release. However, configurations such as hiddenSize=128, LSTM, FP32 with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION may still output non-deterministic results.
  • There was an issue in upgrading the cuDNN version using the RPM and Debian packages in the 8.0.3 version. This issue has been fixed in this release.
  • The ResNet-50 native FP32 inference issues have been fixed on Volta and Turing. Few performance regressions exist in the NVIDIA Ampere GPU architecture.
  • cuDNN exhibited performance regressions for GoogLeNet and U-Net on V100. This issue has been fixed in this release.
  • cuDNN exhibited performance regressions for VGG16 on GA100. This issue has been fixed in this release.
  • The performance regression across Tacotron2 and WaveGlow seen on the Turing architecture have been fixed.
  • The performance regressions in the FastPitch network seen on the Volta and Turing architecture have been fixed.
  • The cuDNN API unconditionally triggers CUDA context initialization. This causes unnecessary host-side performance overhead. This is an issue that was introduced in cuDNN version 8.0.2. This issue has been fixed in this release.
  • Some ResNet-50 and SSD mixed precision inference use-cases may have performance regressions compared to cuDNN 7.6 on V100. V-Net 3D models might have performance regressions on Turing based architectures. This issue has been fixed in this release.
  • Previous cuDNN 8 releases exhibited performance regressions when compared to version 7.6, for some important convolutional networks on the Pascal GPU architecture. In particular, the performance regressions of ResNet-50 seen previously on Pascal with cuDNN versions 8.0.3 and earlier, are fixed with this release.
  • cudnnConvolutionBiasActivationForward() could result in incorrect results when the alpha2 value is zero and the device buffer zData contains NaN. This issue has been fixed in this release.
  • When using cudnnRNN*Ex() APIs, if the layout of RNN data is CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED, and if the batch size is larger than 6144 on Volta or NVIDIA Ampere A100 GPUs, or larger than 4096 on Turing GPUs, CUDNN_STATUS_EXECUTION_FAILED would be returned. This issue has been fixed in this release. cuDNN supports arbitrary batch size.
  • When the user upgraded from cuDNN 8.0.2 to 8.0.3 through the Debian or RPM package, users had to manually uninstall the old libcudnn8-doc package before they installed libcudnn8-samples_*.deb/rpm, otherwise a file conflict could happen. This has been fixed and is no longer the case in the 8.0.4 release.
  • Performance regressions on Turing, Volta and Pascal architectures for True Half convolutions have been resolved.
  • When using cudnnRNN* APIs with the problem sizes (input size, hidden size) not being multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users encountered a return status of CUDNN_STATUS_EXECUTION_FAILED in cudnn built against cuda 11.0. This issue has been fixed with cuDNN built against CUDA 11.1.

Known Issues

  • When using cudnnRNN* APIs with the problem sizes (input size, hidden size) not being multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users encountered a return status of CUDNN_STATUS_EXECUTION_FAILED. This issue affects earlier cuDNN 8.0.1 Preview and cuDNN 8.0.2 releases built against CUDA 11.0.
  • There is a known minor performance regression on small batch sizes for ResNet-50 native FP32 inference that exists on the NVIDIA Ampere GPU architecture.

cuDNN Release 8.0.3

This is the cuDNN 8.0.3 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

cuDNN Backend API
Documentation for the cuDNN Backend API has been included in this release. Users specify the computational case, set up an execution plan for it, and execute the computation via numerous descriptors. The typical use pattern for a descriptor with attributes consists of the following sequence of API calls:
  1. cudnnBackendCreateDescriptor() creates a descriptor of a specified type.
  2. cudnnBackendSetAttribute() sets the values of a settable attribute for the descriptor. All required attributes must be set before the next step.
  3. cudnnBackendFinalize() finalizes the descriptor.
  4. cudnnBackendGetAttribute() gets the values of an attribute from a finalized descriptor.

For more information, refer to the cuDNN Backend API section in the cuDNN API Reference.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.
  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.3 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN 8.0.3 now require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 128-bit boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.3 compared to cuDNN v7.6.
  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.3 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.

Fixed Issues

  • For cudnnConvolutionBackwardFilter, the 3D convolution table, wDesc: _NCHW, _ALGO_1 and FFT_TILING had incorrect data fields. This has been fixed in this release.
  • In prior versions of cuDNN, cudnnPoolingForward() with pooling mode CUDNN_POOLING_MAX might return incorrect result when one of the spatial dimensions has negative padding and the output tensor is larger than the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim(). This issue has been fixed in this release.
  • In cudnnPoolingForward() with average-pooling, when the output tensor data is INT8 type, it is possible for some pixels result to be off by 1. Note that cudnnPoolingForward() rounds to the nearest-even integer. This issue has been fixed in this release.
  • The performance of cudnnConvolutionBiasActivationForward() for INT8x4 use cases on Volta and Turing, INT8x32 use cases on Turing, FP32 and pseudo-FP16 use cases on Volta, Turing, and Ampere GPU architecture have been improved.
  • We have updated our public headers to fully reflect the documented dependencies between the 6 sub-libraries.
  • There were libcudnn_ops/cnn/adv_infer/train_static.a binaries in the cuDNN Debian and tgz packages. Users were advised not to link against those and link against libcudnn_static.a instead. Those binaries have been removed from the release packages.
  • On Volta and Pascal architectures, performance regressions were present for various TRUE_HALF convolutions. This has been fixed in this release.
  • In prior versions of cuDNN, API functions cudnnGetConvolution*Algorithm_v7() return a workspace size in the result for algo1 that is inconsistent with the result of the corresponding cudnnGet*Workspace() calls if the math type of the convolution descriptor is set to CUDNN_FMA_MATH. This issue has been fixed in this release.
  • The new RNN APIs: cudnnRNNForward(), cudnnRNNBackwardData_v8(), and cudnnRNNBackwardWeights_v8() were available as a preview in the cuDNN 8.0.2 release. They no longer hold preview status.
  • When using cudnnRNN*Ex() APIs, if the user planned to use CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED as the layout of the RNN data descriptors, the user would have had to call cudnnSetRNNPaddingMode() to set the mode to CUDNN_RNN_PADDED_IO_ENABLED after initializing an RNNDescriptor but before calling cudnnGetRNNWorkspaceSize(). Not doing this would result in CUDNN_STATUS_EXECUTION_FAILED. We’ve added internal checks to return CUDNN_STATUS_BAD_PARAM to prevent hitting EXECUTION_FAILED.
  • When cudnnBatchNormalizationForwardTrainingEx() is called with NHWC tensors with pseudo-half configuration, under rare occasions the kernel would produce incorrect results, including possible NaNs in the results. This has been fixed in this release. This issue affects earlier releases since 7.4.1.
  • Fused convolution-scale-bias-activation with per-channel α1 and α2 scaling gives incorrect results when the reorder type in the convolution descriptor is set to CUDNN_NO_REORDER. This is an issue in cuDNN version 8.0.2 This issue has been fixed in this release.
  • On NVIDIA Ampere GA100, cudnnConvolutionBackwardData() for Tensor Core enabled problems with half input and output could, in rare cases, could produce incorrect results; the same could happen for users of cudnnBackendExecute() using engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX 57 for backward data. This has been fixed in this release. (not applicable for Jetson platforms)
  • There was a performance regression in MaskRCNN inference with automatic mixed precision on V100. This has been fixed in this release.
  • Two dimensional forward convolutions using algo1 may segfault when the filter size is large. For example, we’ve observed this issue when the filter width and height are more than or equal to 363. This has been fixed in this release.
  • For some 3D spatial non-Tensor-Core convolutions on Maxwell, Pascal, Volta, and Turing architectures, cudnnBackwardFilter() can return incorrect results when the convolution width padding exceeds the value (filterWidth - 1)/2. Likewise, users of cudnnBackendExecute() can experience the same issue when using the engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX 32 for backward filter. The issue affecting cudnnBackwardFilter() has been fixed in this release. With cudnnBackendFinalize(), an engine descriptor with CUDNN_ATTR_ENGINE_GLOBAL_INDEX 32 and a backward filter operation that satisfies the above condition will return CUDNN_STATUS_NOT_SUPPORTED.

Known Issues

  • Occasionally, inaccurate results were observed in outputs of the cudnnRNNBackwardWeights() and cudnnRNNBackwardWeightsEx() functions when the RNN cell type was GRU and the NVIDIA Ampere GPU architecture was used with FP32 I/O and mathType of CUDNN_DEFAULT_MATH or CUDNN_TENSOR_OP_MATH. Users may switch to CUDNN_FMA_MATH as a temporary workaround. This issue is being investigated.

  • cudnnRNN*() with LSTM mode may produce inaccurate results on the cy outputs when clipping is enabled on all GPUs. This issue exists in previous cuDNN releases as well.

  • On Volta and Pascal architectures, performance regressions may be present for various TRUE_HALF convolutions.

  • When the user is using cudnnRNN* APIs with the problem sizes (input size, hidden size) being not multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users may encounter a return status of CUDNN_STATUS_EXECUTION_FAILED. This issue also affects earlier releases cuDNN 8.0.1 Preview and cuDNN 8.0.2.
  • Some ResNet-50 and SSD mixed precision inference use-cases may have performance regressions compared to cuDNN 7.6 on V100. V-Net 3D models might have performance regressions on Turing based architectures.

  • When using cudnnRNN*Ex() APIs, if the user used CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED as the layout of the RNN data descriptors, and if the batch size is larger than 6144 on Volta or NVIDIA Ampere A100 GPUs, or larger than 4096 on Turing GPUs, CUDNN_STATUS_EXECUTION_FAILED would be returned.

  • Documentation of the Backend API is not complete. The CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR and CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR descriptor types will be documented in a future release.

  • The conv_sample_v8.0 sample is not included in the Debian and RPM packages. This will be fixed in a future release.

  • The libfreeimage.a library in the RHEL 8 ppc64le RPM is for the wrong architecture. This will be fixed in a future release.

  • When the user is upgrading from cuDNN 8.0.2 to 8.0.3 through the Debian or RPM package, before installing libcudnn8-samples_*.deb/rpm, users should manually uninstall the old libcudnn8-doc package, otherwise a file conflict may happen.

cuDNN Release 8.0.2

This is the cuDNN 8.0.2 release notes and first GA release of cuDNN 8.x. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

cuDNN 8.0.1 Preview and 8.0.0 Preview

The key features mentioned in cuDNN 8.0.1 Preview and 8.0.0 Preview are now GA quality in this release.

Added new API functions to the documentation

cudnnRNNBackwardData_v8() and cudnnRNNBackwardWeights_v8() are now documented in the cudnn_adv_train.so Library. For a list of functions and data types that were added in this release, see API Changes For cuDNN 8.0.2.

TF32 performance
  • TF32 for 3D convolutions and deconvolution performance is significantly better, up to 3.9x, compared to cuDNN 8.0.1.
  • TF32 for grouped convolutions on A100 were improved up to 1.5x performance compared to cuDNN 8.0.1 on ResNext convolution layers and up to 3x the performance compared to V100 with cuDNN v7.6. (not applicable for Jetson platforms)

The above performance improvements were measured using only cuDNN operations. The observed performance improvements will depend on a number of factors, such as non-cuDNN operations, kernel run time, and model architecture type.

Performance improvements

This release includes performance improvements on all architectures for 2D and 3D grouped convolutions compared with version 7.6. Additionally, we improved kernel selection heuristics on several known Deep Learning GitHub Examples (also known as model scripts).

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.x.x.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.

  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.2 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.

  • Some computational options in cuDNN 8.0.2 now require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 128-bit boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.2 compared to cuDNN v7.6.

  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.2 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.

  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.

  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.

  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.

  • In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.

Fixed Issues

The following issues have been fixed in this release:
  • The implementation of cuDNNLRNCrossChannelBackward() for even-sized normalization windows was incorrect in all previous releases. This issue has been fixed in this release.

  • There isn’t a dedicated API to query the supported or the most performant algo for cudnnConvolutionBiasActivationForward() in cuDNN. It is not recommended to query w via cudnnGetConvolutionForwardAlgorithm_v7. Instead, we recommend using the cuDNN version 8 backend API. The number of supported engines can be queried using enum CUDNN_ATTR_OPERATIONGRAPH_ENGINE_GLOBAL_COUNT from an operation graph descriptor via cudnnBackendGetAttribute().

  • A memcheck error may have occurred on cuDNN version 7.x builds when calling cudnnConvolutionBackwardFilter () on Volta or Turing GPUs. This issue has been fixed in this release.

  • Various convolutions which exhibited sub-optimal performance on GA100 GPU are now achieving ideal performance. (not applicable for Jetson platforms)

  • cudnnCnnTrainVersionCheck() and cudnnCnnInferVersionCheck() were missing in past releases. This issue has been fixed in this release.

  • Documentation of RNN new APIs and deprecations is not complete. The cudnnRNNBackwardData_v8() and cudnnRNNBackwardWeights_v8() have been added to this release.

  • cuDNN 8.0.1 built with Windows and CUDA 11.0 RC had reduced performance on 2D, 3D, and grouped convolutions compared to Linux. This issue has been fixed in this release. (not applicable for Jetson platforms)

  • There was a known issue in cuDNN 8.0.1 when linking statically to cuDNN and using the library's 3D algo1 backward filter convolutions. Users would see the library emit an internal error or incorrectly state that a shared library was missing. This issue has been fixed in this release.

  • When using an RPM file on RedHat for installation, upgrading from cuDNN v7 to cuDNN v8 directly or indirectly via TensorRT 7.1.3 would cause installation errors. This issue has been fixed in this release.

  • The implementation of cuDNNLRNCrossChannelBackward was inconsistent with the implementation of cuDNNLRNCrossChannelForward and returned incorrect results when the normalization window was even. This issue has been fixed in this release.

  • RNN APIs in cuDNN v8.0.1, compiled with CUDA 11.0, used an incorrect default down-conversion on GPUs with CUDA SM version SM80 (NVIDIA Ampere GPU family) when supplied input data and weights have the CUDNN_DATA_FLOAT type and cudnnMathType_t set via cudnnSetRNNMatrixMathType() is CUDNN_DEFAULT_MATH or CUDNN_TENSOR_OP_MATH. Instead of using the default TF32 computation when Tensor Cores are used, a down-conversion to FP16 (half-precision) was performed; same as in the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION mode. This introduced a lower dynamic range of intermediate data but possibly faster execution. To disable the automatic down-conversion of CUDNN_DATA_FLOAT weights and data in RNN APIs, the user needed to set the environmental variable NVIDIA_TF32_OVERRIDE to 0 (notice this would have disabled the use of TF32 in the entire library, which might have a performance impact on CNNs that are not affected by this issue). Another workaround was to assign the CUDNN_FMA_MATH mode to the cudnnMathType_t argument in cudnnSetRNNMatrixMathType(). Due to this, the A100 GPU TF32 feature was not accessible for RNNs in cuDNN v8.0.1. This issue has been fixed in this release. (not applicable for Jetson platforms)

  • cuDNN convolution APIs may return CUDNN_STATUS_EXECUTION_FAILED when the number of input or output channels equals to or exceeds 2097152. This issue exists for all cuDNN 8.0.x releases. This issue has been fixed in this release.

  • Since version 8.0.0 Preview, cudnnConvolutionForward(), cudnnConvolutionBackwardData(), and cudnnConvolutionBackwardFilter() erroneously returned CUDNN_STATUS_INTERNAL_ERROR when the workspace size argument value was less than the required workspace size as returned by their respective cudnnGetWorkspace() API. This issue has been fixed and CUDNN_STATUS_BAD_PARAMS is returned as documented.

Known Issues

  • In this release, the performance of cudnnConvolutionBiasActivationForward() for true-half use cases on Pascal, INT8x4 use cases on Volta, and Turing, compared to version 7.6 is still lower. In addition, FP32 and pseudo-FP16 performance on Volta, Turing and the NVIDIA Ampere GPU architecture is still not fully optimized.

  • The new RNN APIs: cudnnRNNForward(), cudnnRNNBackwardData_v8(), and cudnnRNNBackwardWeights_v8() are available as a preview in the cuDNN 8.0.2 release.

  • Occasionally, inaccurate results were observed in outputs of the cudnnRNNBackwardWeights() and cudnnRNNBackwardWeightsEx() functions when the RNN cell type was GRU and the NVIDIA Ampere GPU architecture was used with FP32 I/O and mathType of CUDNN_DEFAULT_MATH or CUDNN_TENSOR_OP_MATH. Users may switch to CUDNN_FMA_MATH as a temporary workaround. This issue is being investigated.

  • cudnnRNN*() with LSTM mode may produce inaccurate results on the cy outputs when clipping is enabled on all GPUs. This issue exists in previous cuDNN releases as well.

  • On Volta and Pascal architectures, performance regressions may be present for TRUE_HALF convolution backward filter.

  • When using cudnnRNN*Ex() APIs, if the user uses CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED as the layout of the RNN data descriptors, and if the batch size is larger than 6144 on Volta or NVIDIA Ampere A100 GPUs, or larger than 4096 on Turing GPUs, CUDNN_STATUS_EXECUTION_FAILED may be returned.

  • Currently, there are libcudnn_ops/cnn/adv_infer/train_static.a binaries in the cuDNN Debian and tgz packages. Users are advised not to link against those and link against libcudnn_static.a instead. Those binaries will be removed from the release packages in the next release.

  • When using cudnnRNN*Ex() APIs, if the user plans to use CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED as the layout of the RNN data descriptors, the user should call cudnnSetRNNPaddingMode() to set the mode to CUDNN_RNN_PADDED_IO_ENABLED after initializing an RNNDescriptor but before calling cudnnGetRNNWorkspaceSize(). Not doing this may result in CUDNN_STATUS_EXECUTION_FAILED.

  • Updated: August 24, 2020

    Fused convolution-scale-bias-activation with per-channel α1 and α2 scaling gives incorrect results when the reorder type in the convolution descriptor is set to CUDNN_NO_REORDER.

  • Updated: August 24, 2020

    When the user is using cudnnRNN* APIs with the problem sizes (input size, hidden size) being not multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users may encounter a return status of CUDNN_STATUS_EXECUTION_FAILED.

  • Updated: August 24, 2020

    For some 3D spatial non-Tensor-Core convolutions on Maxwell, Pascal, Volta, and Turing architectures, cudnnBackwardFilter() can return incorrect results when the convolution width padding exceeds the value (filterWidth - 1)/2. Likewise, users of cudnnBackendExecute() can experience the same issue when using the engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX 32 for backward filter. The issue affecting cudnnBackwardFilter() has been fixed in this release. With cudnnBackendFinalize(), an engine descriptor with CUDNN_ATTR_ENGINE_GLOBAL_INDEX 32 and a backward filter operation that satisfies the above condition will return CUDNN_STATUS_NOT_SUPPORTED.

cuDNN Release 8.0.1 Preview

Attention: This is the cuDNN 8.0.1 Preview release. This Preview release is for early testing and feedback, therefore, for production use of cuDNN, continue to use cuDNN 7.6.5. This release is subject to change based on ongoing performance tuning and functional testing. For feedback on the new backend API and deprecations, email cudnn@nvidia.com.

These release notes are applicable to JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

  • Added new kernels to improve the performance of fusion.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.0.1.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.

  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.1 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • Some data types are not widely supported by all cuDNN API. For example, CUDNN_DATA_INT8x4 is not supported by many functions. In such cases, support is available by using cudnnTransformTensor() to transform the tensors from the desired type to a type supported by the API. For example, a user is able to transform input tensors from CUDNN_DATA_INT8x4 to CUDNN_DATA_INT8, run the desired API and then transform output tensors from CUDNN_DATA_INT8 to CUDNN_DATA_INT8x4. Note that this transformation will incur an extra round trip to memory.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.

  • Some computational options in cuDNN 8.0.1 now require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 128-bit boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.1 compared to cuDNN v7.6.

  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.1 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.

  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.

  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.

Fixed Issues

The following issues have been fixed in this release:

  • The dimA and strideA parameters in cudnnSetTensorNdDescriptor() do not document the tensor layout. The documentation has been updated to include this information.

  • cuDNN 8.0.0 Preview will not work with GA10x NVIDIA Ampere GPU architectures. This has been fixed in 8.0.1 Preview.

  • cuDNN 8.0.0 Preview removed a restriction on convolution backward filter for output filter with odd products of dimensions (N*C*D*H*W) for a kernel in algo0 for pre-Volta GPUs. This can potentially lead to an illegal memory access error. This restriction is restored in cuDNN 8.0.1 Preview. cuDNN will use a kernel that does not have this restriction for this computation case.

  • Fixed performance issues for pre-Vola architectures for convolutions (except when the compute type is half).

  • Mitigated the performance regression to less than 10% end-to-end.

Known Issues

  • On pre-Volta, there are significant performance issues on convolution layers when the compute type is half.

  • Sub-optimal performance is present in this release for all INT8 convolutions for all GPUs.

  • The performance of cudnnConvolutionBiasActivationForward() is slower than v7.6 in most cases. This is being actively worked on and performance optimizations will be available in the upcoming releases.

  • There are some peer-to-peer documentation links that are broken within the cuDNN API Reference. These links will be fixed in the next release.

  • cudnnCnnTrainVersionCheck() and cudnnCnnInferVersionCheck() are missing in this release and will be added in the GA release.

  • Documentation of RNN new APIs and deprecations is not complete. The cudnnRNNBackwardData_v8() and cudnnRNNBackwardWeights_v8() functions will be implemented in the next release.

  • cuDNN 8.0.1 Preview build with Windows and CUDA 11.0 RC has reduced performance on 2D, 3D, and grouped convolutions compared to Linux.

  • There is a known issue in cuDNN 8.0.1 when linking statically to cuDNN and using the library's 3D algo1 backward filter convolutions. Users will see the library emit an internal error or incorrectly state that a shared library is missing. This is a bug that will be fixed in a future release.

  • When using an RPM file on RedHat for installation, installing cuDNN v8 directly or via TensorRT 7.1.3 will enable users to build their application with cuDNN v8. However, in order for the user to compile an application with cuDNN v7 after cuDNN v8 is installed, the user will need to perform the following steps:
    1. Issue sudo mv /usr/include/cudnn.h /usr/include/cudnn_v8.h.
    2. Issue sudo ln -s /etc/alternatives/libcudnn /usr/include/cudnn.h.
    3. Switch to cuDNN v7 by issuing sudo update-alternatives --config libcudnn and choose cuDNN v7 from the list.

    Steps 1 and 2 are required for the user to be able to switch between v7 and v8 installations. After steps 1 and 2 are performed once, step 3 can be used repeatedly and the user can choose the appropriate cuDNN version to work with. For more information, refer to the Installing From An RPM File and Upgrading From v7 To v8 sections in the cuDNN Installation Guide.

  • When FFT Tiled aglo (i.e., CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING in forward convolution or CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING for backward data) is used for 3D convolution, an intermittent silent failure might happen due to an incorrect stream used for kernel execution. In some cases, this might be manifested as undefined values seen in the output.

  • The implementation of cuDNNLRNCrossChannelBackward is inconsistent with the implementation of cuDNNLRNCrossChannelForward and returns incorrect results when the normalization window is even. This will be fixed in a future release.

  • RNN APIs in cuDNN v8.0.1, compiled with CUDA 11.0, use an incorrect default down-conversion on GPUs with CUDA SM version SM80 (NVIDIA Ampere GPU family) when supplied input data and weights have the CUDNN_DATA_FLOAT type and cudnnMathType_t set via cudnnSetRNNMatrixMathType() is CUDNN_DEFAULT_MATH or CUDNN_TENSOR_OP_MATH. Instead of using the default TF32 computation when Tensor Cores are used, a down-conversion to FP16 (half-precision) is performed; same as in the CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION mode. This introduces a lower dynamic range of intermediate data but possibly faster execution. To disable the automatic down-conversion of CUDNN_DATA_FLOAT weights and data in RNN APIs, set the environmental variable NVIDIA_TF32_OVERRIDE to 0 (notice this will disable the use of TF32 in the entire library, which might have a performance impact on CNNs that are not affected by this issue). Another workaround is to assign the CUDNN_FMA_MATH mode to the cudnnMathType_t argument in cudnnSetRNNMatrixMathType(). Due to this, the A100 TF32 feature is not accessible for RNNs in cuDNN v8.0.1.

  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory which might noticeably impact performance.
  • Updated: August 24, 2020

    cuDNN convolution APIs may return CUDNN_STATUS_EXECUTION_FAILED when the number of input or output channels equals to or exceeds 2097152.

  • Updated: August 24, 2020

    When the user is using cudnnRNN* APIs with the problem sizes (input size, hidden size) being not multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users may encounter a return status of CUDNN_STATUS_EXECUTION_FAILED.

cuDNN Release 8.0.0 Preview

Attention: This is the cuDNN 8.0.0 Preview release. This Preview release is for early testing and feedback, therefore, for production use of cuDNN, continue to use cuDNN 7.6.5. This release is subject to change based on ongoing performance tuning and functional testing. For feedback on the new backend API and deprecations, email cudnn@nvidia.com.
These release notes are applicable to JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).
Note: cuDNN 8.0.0 passed GA quality testing and validation for TensorRT and JetPack users.

For previous cuDNN documentation, see the cuDNN Archived Documentation.

Key Features and Enhancements

The following features and enhancements have been added to this release:

cuDNN library
  • The cuDNN library has been split into the following libraries:
    • cudnn_ops_infer - This entity contains the routines related to cuDNN context creation and destruction, tensor descriptor management, tensor utility routines, and the inference portion of common machine learning algorithms such as batch normalization, softmax, dropout, etc.

    • cudnn_ops_train - This entity contains common training routines and algorithms, such as batch normalization, softmax, dropout, etc. The cudnn_ops_train library depends on cudnn_ops_infer.

    • cudnn_cnn_infer - This entity contains all routines related to convolutional neural networks needed at inference time. The cudnn_cnn_infer library depends on cudnn_ops_infer.

    • cudnn_cnn_train - This entity contains all routines related to convolutional neural networks needed during training time. The cudnn_cnn_train library depends on cudnn_ops_infer, cudnn_ops_train, and cudnn_cnn_infer.

    • cudnn_adv_infer - This entity contains all other features and algorithms. This includes RNNs, CTC loss, and multi-head attention. The cudnn_adv_infer library depends on cudnn_ops_infer.

    • cudnn_adv_train - This entity contains all the training counterparts of cudnn_adv_infer. The cudnn_adv_train library depends on cudnn_ops_infer, cudnn_ops_train, and cudnn_adv_infer.

    • cudnn - This is an optional shim layer between the application layer and the cuDNN code. This layer opportunistically opens the correct library for the API at runtime.

  • cuDNN does not support mixing sub-library versions. If there is a mismatch in the cuDNN version numbers in the cuDNN sub-library header files, the build will crash. The versions need to match on the major number and minor number, as well as the patch level.

  • The cuDNN sub-libraries must be installed under a single directory.

Multiple dynamic libraries
In order to link against a subset of cuDNN, you need to know which subset of the API you are using and then link against the appropriate cuDNN sub-components. The cuDNN sub-components are as follows:
  • cudnn_ops_infer.so
  • cudnn_ops_train.so
  • cudnn_cnn_infer.so
  • cudnn_cnn_train.so
  • cudnn_adv_infer.so
  • cudnn_adv_train.so
cuDNN linking options
There are two different linking options:
  • Linking against individual sub-libraries: Users who link against individual sub-libraries must be able to identify the API exposed by each cuDNN sub-library. Users also need to know the hierarchy of the different cuDNN sub-libraries. Each .so or .a needs to be specified explicitly in the user’s linking command, as well as any external dependencies cuDNN require. For more information, refer to the Limitations section below.

  • Linking against the full cuDNN (compatibility option): This would allow users to use -lcudnn. libcudnn.so is provided as a shim layer that would open the appropriate cuDNN sub-library for any particular cuDNN API call. While libcudnn.a is largely unchanged, it is a statically linked file for all of cuDNN.

cuDNN loading options
For users who want a smaller memory footprint, there are 2 ways of loading the library.
  • Cherry-pick loading: Each sub-library is loaded only when accessed. This will cause the first reference to that sub-library to take a long time but will ensure the user isn’t loading more libraries than they need.

  • All access loading: All available cuDNN sub-libraries are loaded early during runtime.

New API functions

For a list of functions and data types that were added in this release, see API Changes For cuDNN 8.0.0.

General Support of CUDA Graph Capture
CUDA Graphs are now supported for all functions in this release; with the following restrictions.
  • CUDA Toolkit 10.2 or higher is required
  • cuDNN 8.0.0 graphs are captured via the CUDA graph-capture APIs
  • any non-default use of textures by users of cuDNN needs to be disabled prior to capture

cuDNN 8.0.0 does not at this time offer API support to add operations to an existing CUDA graph directly; however, the captured graph may be added to an existing graph through the existing CUDA Graphs API.

Regarding texture usage, cuDNN 8.0.0 by default will not enable texture usage; expert users may enable texture usage where allowed, but that usage will prevent a successful CUDA Graph capture until disabled. In order for cuDNN 8.0.0 to be graph-capture compatible library-wide, the cuDNN 8.0.0 CTC API was updated as described elsewhere.

The usual restrictions for CUDA Graphs apply in addition to these restrictions here.

New APIs for convolution

A new set of API functions to provide a brand new approach to cuDNN that offers more fine-grain control of performance, numerical properties, etc.. for convolution. Using this API, users directly access various engines that compute convolution forward propagation, backward data, backward filter, and generic support for fusion starting with a limited support in this cuDNN 8.0.0 release and expanding support in follow-up releases. Each engine has performance tuning knobs such as GEMM tiling and split-K. Users can use this API to fine-tune their network by querying cuDNN’s heuristics, or doing their own, to find the most optimal engine configuration with which cuDNN computes each network layer.

NVIDIA Ampere GPU architecture support (not applicable for Jetson platforms)
  • Added support for A100 GPU based on NVIDIA Ampere architecture.
  • cuDNN 8.0.0 has seen significant improvements when using A100 GPUs compared to Volta V100 with cuDNN 7.6.
  • Added support for Tensor Float 32 (TF32) for 1D and 2D convolutions. Full support for TF32 will come in future releases such as grouped convolutions and 3D convolutions in addition to further performance tuning.
  • Increased performance for the legacy Tensor Cores (mixed precision for 1D, 2D, 3D, and grouped convolutions.
Turing and Volta architecture improvements
  • New kernels for Tensor Cores and heuristics update for 1D convolution resulting in performance improvements for speech networks such as Jasper and Tacotron2 and WaveGlow, in addition to support for grouped 1D conv (QuartzNet).
  • Added 3D convolutions support of NHWC and improved heuristics and kernels for Tensor Cores in NCHW resulting in performance improvements for VNet, UNet-Medical and UNet-Industrial. Additionally, FP16 3D convolutions are supported as well.
  • Better utilization of Tensor Cores and heuristics for grouped convolutions result in improvements for ResNext.
  • More tuning for vision networks like ResNet-50 ([MXNet] [PyTorch] [TensorFlow]) and SSD ([PyTorch] [TensorFlow]) with new updated heuristics.
Operation fusion

Operation fusion can be achieved via the backend API. The general workflow is similar to running unfused operations, except that instead of creating a single operation Operation Graph, the user may specify a multi-operation Operation Graph. For more information, see Operation Fusion Via The Backend API in the cuDNN Developer Guide.

Depthwise convolution extension

We’ve extended the fprop and dgrad NHWC depthwise kernels to support more combinations (filter sizes/strides) such as 5x5/1x1, 5x5/2x2, 7x7/1x1, 7x7/2x2 (in addition to what we already have, 1x1/1x1, 3x3/1x1, 3x3/2x2), which provides good performance.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the cuDNN Support Matrix for 8.0.0.

Limitations

  • Samples must be installed in a writable location, otherwise the samples can crash.

  • RNN and multi-head attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.0 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in the Results Reproducibility section in the cuBLAS Library User Guide, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, i.e., we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • Some data types are not widely supported by all cuDNN API. For example, CUDNN_DATA_INT8x4 is not supported by many functions. In such cases, support is available by using cudnnTransformTensor() to transform the tensors from the desired type to a type supported by the API. For example, a user is able to transform input tensors from CUDNN_DATA_INT8x4 to CUDNN_DATA_INT8, run the desired API and then transform output tensors from CUDNN_DATA_INT8 to CUDNN_DATA_INT8x4. Note that this transformation will incur an extra round trip to memory.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.

  • Some computational options in cuDNN 8.0.0 now require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 128-bit boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.0 compared to cuDNN v7.6.

  • For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.0 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.

Deprecated Features

The following features are deprecated in cuDNN 8.0.0:
  • Support for Ubuntu 14.04 has been deprecated in this release. Upgrade to 16.04 or 18.04 for continued support.

  • Support for Mac OS X has been deprecated in this release. Operating systems that are currently supported are Linux and Windows.

  • cuDNN version 8 introduces a new API deprecation policy to enable a faster pace of innovation. A streamlined, two-step, deprecation policy will be used for all API changes starting with cuDNN version 8. For details about this new deprecation policy, see Backward Compatibility And Deprecation Policy in the cuDNN Developer Guide.

  • Removed and deprecated API changes. For a list of removed and deprecated APIs, see API Changes For cuDNN 8.0.0.

Fixed Issues

The following issues have been fixed in this release:

  • There is a known issue in that cudnnDestroy() does not destroy all that cudnnCreate() created. Calling cudnnDestroy() after cudnnCreate() has a memory leak in some tests of about 1.6 MB on host memory. This issue has been fixed in cuDNN 8.0.0.

  • Starting in cuDNN 7.6.1, when using the experimental multi-head attention API, it is possible that the forward and backward paths produce different results for the BERT model, when the batch size is greater than one and/or the number of heads is greater than one. This issue has been fixed in cuDNN 8.0.0.

  • The description of cudnnSetCTCLossDescriptorEx() is not clear. This issue has been fixed in cuDNN 8.0.0.

  • Documentation affecting 1x1 convolution functions is not clear, for example cudnnFindConvolutionBackwardDataAlgorithm(). This issue has been fixed in cuDNN 8.0.0.

  • cuDNN forward convolution with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM does not propagate NANs in weights. This issue has been fixed in cuDNN 8.0.0.

  • Document mathematical definitions of all operations in cuDNN. We include full mathematical descriptions for the convolution functions.

  • The functions cudnnGetConvolutionForwardAlgorithm_v7() and cudnnGetConvolutionForwardWorkspaceSize() may return CUDNN_STATUS_SUCCESS while the execution of the same convolution returns CUDNN_STATUS_NOT_SUPPORTED. Similar issues may also happen for convolutionBackwardData() and convolutionBackwardFilter(). This issue is present in cuDNN 7.2.2 library and later versions. This has been fixed in cuDNN 8.0.0.

  • Algorithms returned by cudnnGetConvolution*Algorithm() may, in some limited use cases, fail to execute when they are actually run. This is a cuDNN library-wide issue and applies for convolution forward, convolution backward data, and convolution backward filter operations. This issue is also present in versions prior to cuDNN 8.0.0 EA.

  • cuDNN does not support CUDA graphs. When launching a CUDA graph constructed via a stream capture that includes a cudnnConvolutionForward() operation, you may see cudaErrorLaunchFailure error. This is because CUDA graphs were not supported. The user can proceed.

  • There was a known performance drop in 3D convolutions for some cases on Turing GPUs since cuDNN 7.4.2. This has been fixed on T4. (not applicable for Jetson platforms)

  • There are rare cases where cudnnConvolution* will return STATUS_NOT_SUPPORTED when cudnn*GetWorkspaceSize might return success for a given algorithm. This has been fixed in cuDNN 8.0.0.

  • In previous versions of cuDNN, CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM did not propagate NaN values in some cases. This is fixed in the current release. Users desiring the old behavior can configure ReLU activation and set the floor to be -Inf.

  • The multiHeadAttention sample code was added to the cuDNN 7.6.3 release. The sample code includes a simple NumPy/Autograd reference model of the multi-head attention block that computes the forward response and all derivatives. The test code demonstrates how to use the multi-head attention API, access attention weights, and sequence data.

  • Updated: July 22, 2020

    In version 7.6.x, cudnnConvolutionBackwardData() with PSEUDO_HALF_CONFIG with CUDNN_TENSOR_OP_MATH or FLOAT_CONFIG with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION returns incorrect results in 3D convolution when the filter size of the w dimension is 1 and padding of the w dimension is 0. This issue has been fixed in this release.

Known Issues

  • Performance regressions on V100 are observed in this release on SSD inference use cases if not using TensorRT.

  • There are significant performance regressions on pre-Volta GPUs and some Turing GPUs based on the TU102 architecture. This performance regression is not applicable to T4, JetPack, and Tegra.

  • Sub-optimal performance is present in this release for all INT8 convolutions for all GPUs.

  • The performance of cudnnConvolutionBiasActivationForward() is slower than v7.6 in most cases. This is being actively worked on and performance optimizations will be available in the upcoming releases.

  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half input/output data types a silent error might occur.

  • There are some peer-to-peer documentation links that are broken within the cuDNN API Reference. These links will be fixed in the next release.

  • cudnnCnnTrainVersionCheck() and cudnnCnnInferVersionCheck() are missing in this release and will be added in the GA release.

  • Documentation of RNN new APIs and deprecations is not complete. The cudnnRNNBackwardData_v8() and cudnnRNNBackwardWeights_v8() functions will be implemented in the next release.

  • cuDNN 8.0.0 Preview will not work with GA10x NVIDIA Ampere GPU architectures. This will be fixed in the next release.

  • cuDNN 8.0.0 Preview build with Windows and CUDA 11.0 RC has reduced performance on 2D, 3D, and grouped convolutions compared to Linux.

  • Updated: June 12, 2020

    There is a known issue in cuDNN 8.0.0 when linking statically to cuDNN and using the library's 3D algo1 backward filter convolutions. Users will see the library emit an internal error or incorrectly state that a shared library is missing. This is a bug that will be fixed in a future release.

  • Updated: June 25, 2019

    There is a known issue in cuDNN 8.0.0 when linking statically to cuDNN and using the library's 3D algo1 backward filter convolutions. Users will see the library emit an internal error or incorrectly state that a shared library is missing. This is a bug that will be fixed in a future release.

  • Updated: June 25, 2019
    When using an RPM file on RedHat for installation, installing cuDNN v8 directly or via TensorRT 7.1.3 will enable users to build their application with cuDNN v8. However, in order for the user to compile an application with cuDNN v7 after cuDNN v8 is installed, the user will need to perform the following steps:
    1. Issue sudo mv /usr/include/cudnn.h /usr/include/cudnn_v8.h.
    2. Issue sudo ln -s /etc/alternatives/libcudnn /usr/include/cudnn.h.
    3. Switch to cuDNN v7 by issuing sudo update-alternatives --config libcudnn and choose cuDNN v7 from the list.

    Steps 1 and 2 are required for the user to be able to switch between v7 and v8 installations. After steps 1 and 2 are performed once, step 3 can be used repeatedly and the user can choose the appropriate cuDNN version to work with. For more information, refer to the Installing From An RPM File and Upgrading From v7 To v8 sections in the cuDNN Installation Guide.

  • Updated: July 22, 2020

    cudnnConvolutionForward(), cudnnConvolutionBackwardData(), and cudnnConvolutionBackwardFilter() erroneously returns CUDNN_STATUS_INTERNAL_ERROR when the workspace size argument value is less than the required workspace size as returned by their respective cudnnGetWorkspace() API.

  • Updated: August 24, 2020

    cuDNN convolution APIs may return CUDNN_STATUS_EXECUTION_FAILED when the number of input or output channels equals to or exceeds 2097152.