cuDNN Release Notes

NVIDIA CUDA Deep Neural Network (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. It provides highly tuned implementations of routines arising frequently in DNN applications. These release notes describe the key features, software enhancements and improvements, and known issues for the NVIDIA cuDNN 8.9.7 and earlier releases.

1. cuDNN Release 8.x.x

1.1. cuDNN Release 8.9.7

These are the NVIDIA cuDNN 8.9.7 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Expanded support for FP16 and BF16 fused flash attention by adding Grouped Query Attention (GQA) for NVIDIA Hopper and NVIDIA Ampere GPUs. Added an additional reduction kernel for gradient calculation for key and value tensor with less heads than query tensor. For more information refer to the NVIDIA cuDNN Developer Guide.

Fixed Issues

The following issues have been fixed in this release:
  • When using the Runtime Fusion Engine to compile for convolution backward data related fusions, if the convolution operation has both strides != 1 and dilation !=1, incorrect results may be produced. We have added supported-check conditions to return NOT_SUPPORTED instead.
  • Running convolution backward data with batch size > 65535 && stride > 1 could fail with CUDNN_STATUS_NOT_SUPPORTED during the execution. This issue has been fixed.

Known Issues

  • For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
  • ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
  • A race condition in memory write accesses was flagged by the "compute-sanitizer" tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. Extensive testing on H100, with different clock speeds and computational loads, did not reveal any impact on functional results that were always identical and correct. This issue is currently being investigated.
  • cuDNN's usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors. This issue has been fixed in this release.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 1. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 15
    • 20
    • 21
    • 22
    • 23
    • 25
    • 27
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 43
    • 45
    • 46
    • 47
    • 48
    • 57
    • 59
    • 60
    • 1
    • 2
    • 3
    • 19
    • 20
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 32
    • 33
    • 34
    • 35
    • 36
    • 40
    • 43
    • 46
    • 49
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 62
    • 64
    • 65
    • 8
    • 9
    • 10
    • 13
    • 15
    • 21
    • 22
    • 23
    • 24
    • 25
    • 29
    • 30
    • 31
    • 33
    • 34
    • 37
    • 38
    • 39
    • 40
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 55
    • 56
    • 61
    • 4005
    • 4006
    • 4007
    • 4010
    • 4017
    • 4020
    • 4023
    • 4024
    • 4025
    • 4026
    • 4027
    • 4028
    • 4029
    • 4032
    • 4033
    • 4037
    • 4038
    V100
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 16
    • 17
    • 26
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 42
    • 43
    • 49
    • 57
    • 1
    • 2
    • 3
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 41
    • 44
    • 47
    • 62
    • 2
    • 3
    • 6
    • 7
    • 11
    • 14
    • 21
    • 32
    • 35
    • 36
    • 42
    • 43
    • 44
    • 51
    • 52
    • 61
    • 4001
    • 4002
    • 4008
    • 4015
    • 4016
    • 4017
    • 4018
    • 4019
    • 4020
    • 4021
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 15
    • 16
    • 18
    • 19
    • 24
    • 26
    • 30
    • 31
    • 34
    • 42
    • 43
    • 50
    • 57
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 31
    • 37
    • 39
    • 42
    • 44
    • 45
    • 48
    • 62
    • 5
    • 12
    • 21
    • 26
    • 27
    • 28
    • 32
    • 43
    • 44
    • 51
    • 53
    • 54
    • 61
    • 4003
    • 4004
    • 4007
    • 4009
    • 4015
    • 4017
    • 4018
    • 4019
    • 4020
    • 4022
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.
  • The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

1.2. cuDNN Release 8.9.6

These are the NVIDIA cuDNN 8.9.6 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • RMS normalization is now supported by cuDNN for the forward training and backwards pass with CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 and with CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 4 for the forwards inference pass.
  • Expanded support of FP16 and BF16 flash attention by adding the gradient for relative positional encoding on NVIDIA Hopper GPUs.
  • FP16 and BF16 fused flash attention engine fprop performance has been improved for NVIDIA Hopper and Ampere GPUs giving a speed-up of up to 20% over cuDNN 8.9.5.
  • H100 only: For input fusion of matrix multiplications, tensor A now supports arbitrary input type with scalar, row, and column broadcast or full tensor operation, and the graph of mainloop fusion can be branchy. Inputs will be automatically casted to the compute type and results will be converted to the data type specified by the user.

    We recommend users to set intermediate data as float to avoid unnecessary type conversion. Tensor A and B no longer need to be in the same data type.
    Note: A/B inputs for matrix multiplication operation should have matching data types. Otherwise, an explicit CUDNN_POINTWISE_IDENTITY operation is needed to perform data type conversion after mainloop fusion.
  • Updated runtime fusion engine heuristics for FP16 and BF16 for Matmul operation and INT8 for both Convolution and Matmul operations on the NVIDIA Hopper architecture to support new batched GEMM kernel, with improved fusion performance, and compilation time.

Fixed Issues

The following issues have been fixed in this release:
  • Fixed a race condition where a kernel hang was observed in the ConvBNfprop pattern matching engine on the NVIDIA Hopper architecture when multiple threads were executing concurrently and the system was under heavy load.
  • Fixed the behavior note for the CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 instance normalization engine.
  • On NVIDIA Hopper architectures, incorrect results were possible in the backpropagation of FP16 and BF16 flash attention with dropout enabled. This issue has been fixed.
  • Possibility of a race condition with multiple threads executing the same execution plan in BF16 and FP16 flash attention has been fixed.
  • On Hopper architectures, FP16 and BF16 fused flash attention training needed the layout of the gradient and the input layout to be the same. This issue has been fixed.
  • Fixed typos in the sample code of Use Cases in the cuDNN API Reference documentation.

Known Issues

  • For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
  • ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
  • A race condition in memory write accesses was flagged by the "compute-sanitizer" tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. Extensive testing on H100, with different clock speeds and computational loads, did not reveal any impact on functional results that were always identical and correct. This issue is currently being investigated.
  • cuDNN's usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors. This issue has been fixed in this release.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 2. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 15
    • 20
    • 21
    • 22
    • 23
    • 25
    • 27
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 43
    • 45
    • 46
    • 47
    • 48
    • 57
    • 59
    • 60
    • 1
    • 2
    • 3
    • 19
    • 20
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 32
    • 33
    • 34
    • 35
    • 36
    • 40
    • 43
    • 46
    • 49
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 62
    • 64
    • 65
    • 8
    • 9
    • 10
    • 13
    • 15
    • 21
    • 22
    • 23
    • 24
    • 25
    • 29
    • 30
    • 31
    • 33
    • 34
    • 37
    • 38
    • 39
    • 40
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 55
    • 56
    • 61
    • 4005
    • 4006
    • 4007
    • 4010
    • 4017
    • 4020
    • 4023
    • 4024
    • 4025
    • 4026
    • 4027
    • 4028
    • 4029
    • 4032
    • 4033
    • 4037
    • 4038
    V100
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 16
    • 17
    • 26
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 42
    • 43
    • 49
    • 57
    • 1
    • 2
    • 3
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 41
    • 44
    • 47
    • 62
    • 2
    • 3
    • 6
    • 7
    • 11
    • 14
    • 21
    • 32
    • 35
    • 36
    • 42
    • 43
    • 44
    • 51
    • 52
    • 61
    • 4001
    • 4002
    • 4008
    • 4015
    • 4016
    • 4017
    • 4018
    • 4019
    • 4020
    • 4021
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 15
    • 16
    • 18
    • 19
    • 24
    • 26
    • 30
    • 31
    • 34
    • 42
    • 43
    • 50
    • 57
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 31
    • 37
    • 39
    • 42
    • 44
    • 45
    • 48
    • 62
    • 5
    • 12
    • 21
    • 26
    • 27
    • 28
    • 32
    • 43
    • 44
    • 51
    • 53
    • 54
    • 61
    • 4003
    • 4004
    • 4007
    • 4009
    • 4015
    • 4017
    • 4018
    • 4019
    • 4020
    • 4022
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.
  • The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

1.3. cuDNN Release 8.9.5

These are the NVIDIA cuDNN 8.9.5 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Announcements

  • cuDNN for CUDA 12.x no longer supports Ubuntu 18.04.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Layer normalization forward pass and backward pass performance has been improved on NVIDIA Ampere and NVIDIA Hopper GPUs for popular large language models (LLM), providing a speedup of nearly 2x-40x over cuDNN 8.9.4.
  • SM carveout is supported on NVIDIA Hopper; whereby you can set the number of SMs to target for parallel execution. This feature allows expert users to reserve SMs for concurrent execution on a separate CUDA stream.
  • FP16 and BF16 fused flash attention engine bprop performance has been improved for NVIDIA Hopper GPUs giving a speedup of nearly 2x-3x over cuDNN 8.9.4.
  • For small batch sizes, FP8 fused flash attention is 25% - 50% faster compared to cuDNN 8.9.4.

Fixed Issues

The following issues have been fixed in this release:
  • For NVIDIA Hopper architectures, we fixed a memory leak in the DgradDreluBNBwdWeight, ConvBNwgrad, and ConvBNfprop pattern matching engines.
  • A race condition that could cause numerically inaccurate results when computing batch normalization in a multi-GPU setting has been corrected. This issue affected all previous 8.9 releases for batch norm engines with CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0.
  • On NVIDIA Ampere and Hopper architectures, incorrect results were possible in the backpropagation of FP16 and BF16 flash attention with dropout enabled. This issue has been fixed.
  • On NVIDIA Ampere and Hopper architectures, incorrect results were possible in variable sequence lengths and when the sequence length was not a multiple of 128. This issue has been fixed.
  • For NVIDIA Hopper architectures, for DgradDreluBNBwdWeight, ConvBNwgrad, and ConvBNfprop matched engines, static CGA was deprecated and dynamic CGA is supported. You can only use CUDNN_KNOB_TYPE_TILE_CGA_M and CUDNN_KNOB_TYPE_TILE_CGA_N knobs to set CGA size. The CUDNN_KNOB_TYPE_TILE_CGA knobs are deprecated for DgradDreluBNBwdWeight, ConvBNwgrad, and ConvBNfprop engines.
  • The documentation of cudnnDestroy has been updated to reflect the proper ordering of this call with respect to destroying the CUDA context.
  • FP8 fused flash attention was slower than FP16 fused flash attention for small batch sizes. This issue has been fixed.

Known Issues

  • On Hopper architectures, FP16 and BF16 fused flash attention training needs the layout of the gradient and the input layout to be the same.
  • ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
  • A race condition in memory write accesses was flagged by the "compute-sanitizer" tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. Extensive testing on H100, with different clock speeds and computational loads, did not reveal any impact on functional results that were always identical and correct. This issue is currently being investigated.
  • cuDNN's usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors. This issue has been fixed in this release.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 3. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 15
    • 20
    • 21
    • 22
    • 23
    • 25
    • 27
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 43
    • 45
    • 46
    • 47
    • 48
    • 57
    • 59
    • 60
    • 1
    • 2
    • 3
    • 19
    • 20
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 32
    • 33
    • 34
    • 35
    • 36
    • 40
    • 43
    • 46
    • 49
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 62
    • 64
    • 65
    • 8
    • 9
    • 10
    • 13
    • 15
    • 21
    • 22
    • 23
    • 24
    • 25
    • 29
    • 30
    • 31
    • 33
    • 34
    • 37
    • 38
    • 39
    • 40
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 55
    • 56
    • 61
    • 4005
    • 4006
    • 4007
    • 4010
    • 4017
    • 4020
    • 4023
    • 4024
    • 4025
    • 4026
    • 4027
    • 4028
    • 4029
    • 4032
    • 4033
    • 4037
    • 4038
    V100
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 16
    • 17
    • 26
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 42
    • 43
    • 49
    • 57
    • 1
    • 2
    • 3
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 41
    • 44
    • 47
    • 62
    • 2
    • 3
    • 6
    • 7
    • 11
    • 14
    • 21
    • 32
    • 35
    • 36
    • 42
    • 43
    • 44
    • 51
    • 52
    • 61
    • 4001
    • 4002
    • 4008
    • 4015
    • 4016
    • 4017
    • 4018
    • 4019
    • 4020
    • 4021
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 15
    • 16
    • 18
    • 19
    • 24
    • 26
    • 30
    • 31
    • 34
    • 42
    • 43
    • 50
    • 57
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 31
    • 37
    • 39
    • 42
    • 44
    • 45
    • 48
    • 62
    • 5
    • 12
    • 21
    • 26
    • 27
    • 28
    • 32
    • 43
    • 44
    • 51
    • 53
    • 54
    • 61
    • 4003
    • 4004
    • 4007
    • 4009
    • 4015
    • 4017
    • 4018
    • 4019
    • 4020
    • 4022
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.
  • The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

1.4. cuDNN Release 8.9.4

These are the NVIDIA cuDNN 8.9.4 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • FP16 and BF16 fused flash attention engine fprop performance has been improved for NVIDIA Hopper GPUs giving a speedup of nearly 2x-3x over cuDNN 8.9.3.

Fixed Issues

The following issues have been fixed in this release:
  • Fixed IMA in the grouped_direct kernel for tensor sizes greater than 2^32 / [size of input element data type].
  • Fixed large overhead of enabling dropout in FP16 and BF16 fused flash attention engine for training.
  • Fixed the output for the dx tensor in FP8 backwards data grouped convolutions. Previously, the values for only the first group were written, with zeros written elsewhere.

Known Issues

  • On NVIDIA Ampere and Hopper architectures, incorrect results are possible in variable sequence lengths and when the sequence length is not a multiple of 128.
  • ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
  • For NVIDIA Hopper architectures, for DgradDreluBNBwdWeight, ConvBNwgrad, and ConvBNfprop matched engines, static CGA is deprecated and dynamic CGA is supported. You can only use CUDNN_KNOB_TYPE_TILE_CGA_M and CUDNN_KNOB_TYPE_TILE_CGA_N knobs to set CGA size. The CUDNN_KNOB_TYPE_TILE_CGA knobs are deprecated for DgradDreluBNBwdWeight, ConvBNwgrad, and ConvBNfprop engines.
  • A race condition in memory write accesses was flagged by the "compute-sanitizer" tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. Extensive testing on H100, with different clock speeds and computational loads, did not reveal any impact on functional results that were always identical and correct. This issue is currently being investigated.
  • The FP8 fused flash attention is known to be slower than FP16 fused flash attention for small batch sizes.
  • cuDNN's usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors. This issue has been fixed in this release.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 4. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 15
    • 20
    • 21
    • 22
    • 23
    • 25
    • 27
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 43
    • 45
    • 46
    • 47
    • 48
    • 57
    • 59
    • 60
    • 1
    • 2
    • 3
    • 19
    • 20
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 32
    • 33
    • 34
    • 35
    • 36
    • 40
    • 43
    • 46
    • 49
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 62
    • 64
    • 65
    • 8
    • 9
    • 10
    • 13
    • 15
    • 21
    • 22
    • 23
    • 24
    • 25
    • 29
    • 30
    • 31
    • 33
    • 34
    • 37
    • 38
    • 39
    • 40
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 55
    • 56
    • 61
    • 4005
    • 4006
    • 4007
    • 4010
    • 4017
    • 4020
    • 4023
    • 4024
    • 4025
    • 4026
    • 4027
    • 4028
    • 4029
    • 4032
    • 4033
    • 4037
    • 4038
    V100
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 16
    • 17
    • 26
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 42
    • 43
    • 49
    • 57
    • 1
    • 2
    • 3
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 41
    • 44
    • 47
    • 62
    • 2
    • 3
    • 6
    • 7
    • 11
    • 14
    • 21
    • 32
    • 35
    • 36
    • 42
    • 43
    • 44
    • 51
    • 52
    • 61
    • 4001
    • 4002
    • 4008
    • 4015
    • 4016
    • 4017
    • 4018
    • 4019
    • 4020
    • 4021
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 15
    • 16
    • 18
    • 19
    • 24
    • 26
    • 30
    • 31
    • 34
    • 42
    • 43
    • 50
    • 57
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 31
    • 37
    • 39
    • 42
    • 44
    • 45
    • 48
    • 62
    • 5
    • 12
    • 21
    • 26
    • 27
    • 28
    • 32
    • 43
    • 44
    • 51
    • 53
    • 54
    • 61
    • 4003
    • 4004
    • 4007
    • 4009
    • 4015
    • 4017
    • 4018
    • 4019
    • 4020
    • 4022
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.
  • The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

1.5. cuDNN Release 8.9.3

These are the NVIDIA cuDNN 8.9.3 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Major improvements in the FP16/BF16 fused flash attention engine:
    • The supported layout of Q, K, V and O tensors and their gradients have been generalized. Now, the only requirement is that the d dimension must have stride 1. All other dimensions can have arbitrary strides.
    • The engine now supports both self attention and cross attention.
    • The engine now allows online generation of padding mask and causal mask; each one can be turned on/off independently.
    • The engine now supports an extra additive bias passed in as a tensor after the first batch MatMul and before the Softmax, which can be used to pass in customized masks (for example, Alibi).
    • The engine now supports multi-query attention where the headcount dimension of K and V tensors are 1
    • The engine now supports (padded) variable sequence length computation.
    • The engine now supports head dimensions to be 64 or 128.
    • The engine now supports arbitrary sequence length, batch size, and headcount.
    • The engine now supports both dropouts turned on and off.
    • The engine now supports the dropout mask to be accepted as a tensor rather than being generated inside the kernel.
    • The engine now supports the mask before Softmax to be an input tensor created by the user, rather than being online generated.
  • Improved CUDNN_HEUR_MODE_A heuristic recommendations for depthwise convolution on NVIDIA Hopper GPUs.
  • Improved error reporting during cuDNN handle creation. For more information on how to enable error reporting, refer to the NVIDIA cuDNN Developer Guide.

Fixed Issues

The following issues have been fixed in this release:
  • Use of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 17 for fused convolutions with bias and ReLU could generate incorrect results on the Maxwell GPU architectures. This issue is now fixed in 8.9.3.
  • Running Scale-Bias-Add-Relu-Conv-GenStats (refer to ConvBNfprop) with CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 could generate incorrect results on NVIDIA Ampere GPUs. This applies to whether dual tensors are set or not in the cited pattern. This pattern is no longer supported on NVIDIA Ampere GPUs.
  • A bug in FP16/BF16 fused flash attention for large problem sizes, that is, when b * h * s * d > 512 * 64 * number of SMs, which was resulting in incorrect values being computed, has been fixed.

Known Issues

  • A race condition in memory write accesses was flagged by the "compute-sanitizer" tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. Extensive testing on H100, with different clock speeds and computational loads, did not reveal any impact on functional results that were always identical and correct. This issue is currently being investigated.
  • The FP8 fused flash attention is known to be slower than FP16 fused flash attention for small batch sizes.
  • cuDNN's usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors. This issue has been fixed in this release.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 5. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.6. cuDNN Release 8.9.2

These are the NVIDIA cuDNN 8.9.2 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Improvements in the fused flash attention runtime fusion engine:
    • Added new performance knobs for the engine.
    • Updated heuristics for the engine for both fprop and bprop.
  • Extended fused (non-flash) attention engine to accept dropout mask as an input rather than being generated inside the kernel.
  • Improved the performance of the runtime fusion engine for the NVIDIA Hopper architecture and improved the compilation time.
  • Updated runtime fusion engine heuristics for FP16 and FP8 convolution fprop, dgrad, and Matmul operations to support new kernels on the NVIDIA Hopper architecture.
  • Relaxed the limitation that only one reduction operation is allowed in the DAG of g2 of Generic Runtime Fusion Engines.

Fixed Issues

The following issues have been fixed in this release:
  • Use of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 15 for fused convolutions with bias and ReLU could generate incorrect results on the Pascal GPU architectures for cuDNN 8.9 releases. This issue is now fixed in 8.9.2. Users of cudnnConvolutionBiasActivationForward would have been similarly affected.
  • Additional fixes were implemented in cudnnRNNBackwardWeights_v8() to harden the process of transferring the variable sequence length array from the RNN data descriptor to device memory. As in cuDNN 8.9.1, the const int32_t devSeqLengths[] argument is no longer used and can be set to NULL by the user.

Known Issues

  • The FP8 fused flash attention is known to be slower than FP16 fused flash attention for small batch sizes.
  • cuDNN's usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors. This issue has been fixed in this release.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 6. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.7. cuDNN Release 8.9.1

These are the NVIDIA cuDNN 8.9.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Improved library-wide error reporting coverage by providing the triggered error condition for CUDNN_STATUS_BAD_PARAM and CUDNN_STATUS_NOT_SUPPORTED in the error and warning level logging. The majority of error logs now include not just the error code, but also the error "reason" (that is, the condition that failed). Refer to Error Reporting And API Logging for instructions to enable the error reporting feature.
  • Expanded support of fused flash attention inference and training to sequence lengths of multiples of 64 and hidden dimension of 64 and 128.
  • Expanded support of fused attention to accept a general mask as an input for training, added backward pass for relative positional encoding, and added support for sequence lengths of multiples of 64 (not multiples of 128) up to 512.
  • Fine-tuned runtime fusion heuristics for NVIDIA Ada architecture for FP16 and INT8 convolution forward propagation.
  • HEUR_MODE_A supports BnAddRelu and DReluForkDBn patterns. The support is currently limited to operation graphs with single node multi-GPU batch norms without any pointwise node fusions.

Fixed Issues

The following issues have been fixed in this release:
  • Corrected the engine config returned by heuristics for FP8 convolution.
  • cudnnGetConvolutionForwardAlgorithm_v7(), cudnnGetConvolutionBackwardDataAlgorithm_v7(), and cudnnGetConvolutionBackwardFilterAlgorithm_v7() would exhibit a memory leak on NVIDIA Hopper GPUs. This issue has been fixed in this release.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempted to run an optimized kernel if the values of N, C, H, and W were even. In cuDNN versions before 8.4, it was possible that incorrect values were generated if odd values for the strides of N or C were used. This issue has been fixed in this release.
  • cuDNN 8.9.1 added tensor alignment checks to instance norm and layer norm engines to prevent IMA issues.
  • Starting in cuDNN 8.9.1, the const int32_t devSeqLengths[] argument in cudnnRNNForward(), cudnnRNNBackwardData_v8(), and cudnnRNNBackwardWeights_v8() APIs will be ignored. All three functions will source variable sequence length arrays from RNN data descriptors, configured through the seqLengthArray parameter of cudnnSetRNNDataDescriptor(). The user does not need to transfer this array to device memory; the operation will be performed automatically by RNN APIs. This refinement simplifies the usage of cuDNN RNN APIs. It is also a workaround for random crashes in multi-GPU RNN training on TensorFlow. Replacing earlier versions of cuDNN 8.x shared libraries with cuDNN 8.9.1 will eliminate those crashes without forcing the user to switch the TensorFlow version. The cause of intermittent corruptions of devSeqLengths[], fed to RNN APIs, is still being investigated.
  • Compared to cuDNN 7.6, there were known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs. This issue has been fixed in this release.

Known Issues

  • cuDNN's usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors. This issue has been fixed in this release.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 7. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.8. cuDNN Release 8.9.0

These are the NVIDIA cuDNN 8.9.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Added support for transformer models training and inference using Flash Attention in cuDNN runtime fusion engine. For more information, refer to the flash fused multi-head attention fprop and bprop patterns listed in the NVIDIA cuDNN Developer Guide.
  • Added support for FP8 fused-multi-head attention training and inference support targeting BERT on NVIDIA Hopper GPUs.
  • The runtime fusion engine has been split into four engines (with engine id 0, 1, 2, and 3) according to functional coverage and device architecture.
    • Engine id 0 targets general runtime fusion for the NVIDIA Volta and NVIDIA Turing architectures.
    • Engine id 1 targets general runtime fusion for the NVIDIA Ampere architecture.
    • Engine id 2 targets general runtime fusion for the NVIDIA Hopper architecture.
    • Engine id 3 targets multihead attention related fusions for the NVIDIA Ampere and NVIDIA Hopper architectures.

    This split has enabled the heuristics to have more control over kernel selection. Knob information is now more precise for each splitted engine. As a result, internal testing coverage has also improved.

  • A new runtime fusion engine (with engine id 4) has been introduced to enable up to 4x better compilation time and better epilogue fusion efficiency. It’s currently recommended by heuristics for a subset of patterns and problem sizes. We expect the coverage to grow further in future releases.
  • Improved FP8 Dgrad heuristics to have generally good out-of-the-box support for both CUDNN_DATA_FLOAT and CUDNN_DATA_FAST_FLOAT_FOR_FP8 compute type.
  • Enabled a new knob CUDNN_KNOB_TYPE_SPLIT_K_SLC in runtime fusion engines that can be used to control Split-K in SM 7.X, SM 8.X or Segment-K in SM 9.X convolution or GEMM kernels.
  • Added TF32 heuristics for runtime fusion for the NVIDIA Hopper architecture.
  • Added support for the DgradDreluBNBwdWeight fusion pattern on NVIDIA Hopper GPUs. For more information, refer to the DgradDreluBNBwdWeight pattern listed in the NVIDIA cuDNN Developer Guide.
  • Two new patterns were added to the ConvBNFprop fusion patterns. These are supported on NVIDIA Hopper GPU’s. For more information, refer to the ConvBNFprop pattern listed in the NVIDIA cuDNN Developer Guide.
  • Added a new runtime compiled engine with some fusion capabilities to support single GPU and single node multi-GPU batch normalization. Previously, this engine was statically compiled, but it’s now a runtime compiled engine in cuDNN 8.9. For more information, refer to the BNAddRelu pattern listed in the NVIDIA cuDNN Developer Guide.

Fixed Issues

The following issues have been fixed in this release:
  • When cuDNN executed FP8 matrix multiplication and convolution kernels with compute type CUDNN_DATA_FLOAT, the numerical precision was lower than expected. This issue has been fixed in this release.
  • Some engines incorrectly returned a success status when calling cudnnBackendFinalize() with a backend engine descriptor for a bfloat16 problem on hardware that did not support the bfloat16 type. Attempted execution of the problem with cudnnBackendExecute() would return the expected status CUDNN_STATUS_ARCH_MISMATCH. This issue has been fixed in this release.

Known Issues

  • cudnnGetConvolutionForwardAlgorithm_v7(), cudnnGetConvolutionBackwardDataAlgorithm_v7(), and cudnnGetConvolutionBackwardFilterAlgorithm_v7() exhibit a memory leak on NVIDIA Hopper GPUs.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 8. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.9. cuDNN Release 8.8.1

These are the NVIDIA cuDNN 8.8.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Fixed Issues

  • In cuDNN 8.8.0, the runtime fusion engine failed with NVRTC from CUDA 12.1. This bug is fixed in cuDNN 8.8.1, but the limitation still remains that the runtime fusion engine is not forward compatible beyond CUDA 12.1. This limitation will be addressed in a future release.

Known Issues

  • Some engines will incorrectly return a success status when calling cudnnBackendFinalize() with a backend engine descriptor for a bfloat16 problem on hardware that does not support the bfloat16 type. Attempted execution of the problem with cudnnBackendExecute() will return the expected status CUDNN_STATUS_ARCH_MISMATCH.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • The cuDNN 8.7.0 library may exhibit some slowdowns in wgrad calculation for EfficientDet, EfficientNet, Mask R-CNN, ResNet, ResNeXt, and SSD layers when it was built with CUDA Toolkit 10.2.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
  • On Windows 10, the knob settings of CUDNN_CONVOLUTION_CUTLASS_ANALYTIC_16816_NHWC_ENGINE and CUDNN_CONVOLUTION_IMPLICIT_PRECOMPUTED_GEMM_CUTLASS_16816_NHWC_ENGINE are incorrect. Therefore, these two engines cannot be configured properly. It will impact the performance of convolution cases that use them on a Windows system.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • In cuDNN 8.8.1 for CUDA 12.x, runtime fusion engines will only work with NVRTC from CUDA Toolkit 12.0 and 12.1. It is not forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 9. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.10. cuDNN Release 8.8.0

These are the NVIDIA cuDNN 8.8.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Announcements

  • cuDNN 8.7.0 was the last release supporting NVIDIA Kepler (SM 3.x) devices. It has been removed in cuDNN 8.8.0.
  • cuDNN 8.8.0 changed the linking procedure of NVRTC in the static build. cuDNN 8.8.0 requires NVRTC to be statically linked in the static build rather than the previous dynamic linking. This will also remove support for static linking of cuDNN with CUDA toolkits < 11.5. Refer to the updated instructions in the NVIDIA cuDNN Installation Guide. There were no changes for the linking procedure in the dynamic build.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • This release adds CUDA 12 support.
  • Added three precompiled engines for instance normalization: instance normalization forward training engine, instance normalization forward inference engine, and instance normalization backward engine.
  • Added three precompiled engines for layer normalization: layer normalization forward training engine, layer normalization forward inference engine, and layer normalization backward engine.
  • Added further fusion patterns possible in Mha-Fprop fusions, which target causal masking, relative positional embedding bias, and cross attention giving cuDNN full support for the T5 model. For more information, refer to the Mha-Fprop fusion pattern in the NVIDIA cuDNN Developer Guide.
  • Added support for Mha-Bprop fusions using the runtime fusion engine, targeting MHA training. For more information, refer to the Mha-Bprop fusion pattern in the NVIDIA cuDNN Developer Guide. The cuDNN implementation provides a speedup of ~2x-3x in BERT and T5 patterns in training over native unfused PyTorch. Samples can also be found in the cuDNN frontend repository.
  • Double precision is supported for the CTC loss in the cuDNN 8.x API by cudnnCTCLoss_v8(), cudnnSetCTCLossDescriptor_v8() and cudnnGetCTCLossWorkspaceSize_v8(), and in all other cuDNN API versions by cudnnCTCLoss(), cudnnSetCTCLossDescriptor() and cudnnGetCTCLossWorkspaceSize().
  • CUDNN_HEUR_MODE_A supports the following new operation graphs: ConvBNfprop, ConvBNwgrad, and DgradDreluBNBwdWeight.
  • The dBNapply and DualdBNapply patterns are now generally supported by the runtime fusion engine. Heuristics query for these two patterns are also now supported. This provides more flexible support in data types and also provides a 5% speed up compared to the original pre-compiled specialized engines for these patterns across a wide range of workloads.
  • Improved performance for 1D NHWC depthwise convolution.
  • Improved performance for Tensor Core accelerated FP32 operations on NVIDIA Hopper.

Fixed Issues

  • Use of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for convolution, backward data, and backward filter batch normalization fusions resulted in a performance regression in cuDNN v8.7 on NVIDIA Ampere architecture. This has been improved upon in this release.
  • For NCHW spatially packed input tensors, cudnnBatchNormalizationForwardInference() with mode CUDNN_BATCHNORM_SPATIAL or CUDNN_BATCHNORM_SPATIAL_PERSISTENT can now support up to size 2147483136 (2^31-512) in the spatial dimension (D)HW. For NCHW spatially packed input tensors, cudnnBatchNormalizationBackward() with mode CUDNN_BATCHNORM_SPATIAL or CUDNN_BATCHNORM_SPATIAL_PERSISTENT can now support up to size 2147483136 (2^31-512) in the spatial dimension (D)HW for some cases.
  • Use of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for batch normalization forwards training and batch normalization backwards could generate illegal memory access errors for large tensors. This is fixed in this release by limiting the input tensor size to be less than or equal to 2^31.
  • Documentation was updated for the API function cudnnReorderFilterAndBias() to list other possible return statuses and what would cause them to be returned.
  • The cuDNN static builds no longer dynamically load NVRTC, and therefore requires static linking of NVRTC instead.
  • Fixed incorrect RNN results on Pascal family GPUs with 6.0 or 6.1 compute capability, also in the CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H algorithm, in the RNN forward and backward APIs, when the input data minibatch was divisible by eight, and for certain hidden size values.

Known Issues

  • Some engines will incorrectly return a success status when calling cudnnBackendFinalize() with a backend engine descriptor for a bfloat16 problem on hardware that does not support the bfloat16 type. Attempted execution of the problem with cudnnBackendExecute() will return the expected status CUDNN_STATUS_ARCH_MISMATCH.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • The cuDNN 8.7.0 library may exhibit some slowdowns in wgrad calculation for EfficientDet, EfficientNet, Mask R-CNN, ResNet, ResNeXt, and SSD layers when it was built with CUDA Toolkit 10.2.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
  • On Windows 10, the knob settings of CUDNN_CONVOLUTION_CUTLASS_ANALYTIC_16816_NHWC_ENGINE and CUDNN_CONVOLUTION_IMPLICIT_PRECOMPUTED_GEMM_CUTLASS_16816_NHWC_ENGINE are incorrect. Therefore, these two engines cannot be configured properly. It will impact the performance of convolution cases that use them on a Windows system.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • cuDNN 8.8.0 runtime fusion will only work with NVRTC from CUDA ToolKit 12.0. It is not forward compatible with CUDA Toolkit 12.1 and later.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 10. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

Deprecated and Removed Features

The following features are deprecated in cuDNN 8.8.0:
  • cuDNN 8.7.0 was the last release supporting NVIDIA Kepler (SM 3.x) devices. It has been removed in cuDNN 8.8.0.

1.11. cuDNN Release 8.7.0

These are the NVIDIA cuDNN 8.7.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

  • Added the cudnnRngDistribution_t, CUDNN_BACKEND_OPERATION_RNG_DESCRIPTOR and CUDNN_BACKEND_RNG_DESCRIPTOR functions to the Backend API. This new operation helps a cuDNN graph to create a tensor using a probability distribution which can then be used as an input to other operations. For example, it can be used as a mask in dropout. Currently, it has limited support via the runtime fusion engine in particular patterns. For more information refer to the NVIDIA cuDNN Developer Guide. Support will be extended in future versions. For more information, refer to the NVIDIA cuDNN Backend API documentation.
  • Added support for MatMul-MatMul fusions via the runtime fusion engine, targeting MHA inference. For more information, refer to the MatMul-MatMul fusion pattern in the NVIDIA cuDNN Developer Guide. The cuDNN implementation provides a speedup of ~4x-4.5x in BERT and T5 patterns in inference over native unfused PyTorch.
  • Added native NVIDIA Hopper support for matrix multiplication and its fusions in FP16 mixed precision (FP16 I/O with FP32 compute), which improves performance of matmul ops on Hopper compared to cuDNN 8.6.0.
  • Added FP8 input support for convolution backward data and backward weights operations, with two possible compute precision types CUDNN_DATA_FLOAT and CUDNN_DATA_FAST_FLOAT_FOR_FP8 (faster but lower precision).
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Added support for ConvBNfprop and ConvBNwgrad fusion patterns on NVIDIA Hopper GPU’s. For more information, refer to the ConvBNfprop and ConvBNwgrad patterns listed in the NVIDIA cuDNN Developer Guide.
  • Additional tensor layout support was added for the forward and backwards resampling modes that were added in cuDNN 8.6.0.

Fixed Issues

  • The backend engine 6000 was not respecting the CUDNN_KNOB_TYPE_TILE_SIZE knob that was passed by the user and it had runtime failures on Windows. Both of these issues have been fixed in this release.
  • In CUDA graph capture mode, CUDA streams internal to cuDNN were not guaranteed to have the same priority as the user stream that is set by cudnnSetStream(). This issue is fixed in cuDNN 8.7.0, but requires CUDA 11.8 or later.
  • On Turing, Volta, Kepler, and Maxwell GPUs, 3D convolutions used to exhibit some slowdowns when the padding size was larger than the filter size. 2D convolutions used to encounter an illegal memory access error when the padding size was larger than the filter size and the horizontal stride was larger than 1. These issues have been fixed in this release.
  • The performance of the runtime fusion engine was suboptimal on Windows. This has been fixed in this release.
  • Users of cuDNN's CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING could see CUDNN_STATUS_BAD_PARAM returned for a problem that should otherwise be supported by that choice of algo. This has been fixed in this release.
  • cudnnDropoutForward() and cudnnDropoutBackward() would return incorrect results when input and/or output tensors have overlapping strides. This issue has been fixed in this release.
  • Users of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for batch normalization forwards training and batch normalization backwards could obtain incorrect results when the batch size was greater than 1 and when channel count was not evenly divisible by 8. These values of CUDNN_ATTR_ENGINE_GLOBAL_INDEX correspond to newly added multi-GPU batch normalization support within cuDNN 8.5. Use of single-GPU batch normalization was unaffected by this issue. This limitation has been fixed in this release.
  • In cuDNN 8.5 built with CUDA 11.x, all RNN APIs started to use internal CUDA streams of the same priority as the user stream passed through the cudnnSetStream() function. This update introduced a bug. When the internal heuristic decided to transpose RNN weights in cudnnRNNForward()or corresponding, deprecated functions: cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNForwardInferenceEx(), cudnnRNNForwardTrainingEx(), and cudnnRNNAlgo_t was set to CUDNN_RNN_ALGO_STANDARD, the matrix transposition was performed in the synchronous stream 0 instead of a CUDA stream of the same priority as user stream. This caused serial execution of transpose kernels and excessive synchronization in the initial stage of the forward API. The bug affected cuDNN 8.5 and 8.6 built with both CUDA 11.x and CUDA 10.2. This issue has been fixed in this release.
  • cudnnNormalizationForwardTraining() did not support BFLOAT16. If an input tensor used BFLOAT16, the API would return BAD_PARAM. This issue has been fixed in this release.

Known Issues

  • Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • The cuDNN 8.7.0 library may exhibit some slowdowns in wgrad calculation for EfficientDet, EfficientNet, Mask R-CNN, ResNet, ResNeXt, and SSD layers when it was built with CUDA Toolkit 10.2.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
  • On Windows 10, the knob settings of CUDNN_CONVOLUTION_CUTLASS_ANALYTIC_16816_NHWC_ENGINE and CUDNN_CONVOLUTION_IMPLICIT_PRECOMPUTED_GEMM_CUTLASS_16816_NHWC_ENGINE are incorrect. Therefore, these two engines cannot be configured properly. It will impact the performance of convolution cases that use them on a Windows system.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Within the cuDNN version 8 backend API, the following engines are known to not be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 11. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The cuDNN static builds load NVRTC dynamically when using the runtime fusion engine.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.12. cuDNN Release 8.6.0

These are the NVIDIA cuDNN 8.6.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Added support for the NVIDIA Hopper™ (H100) architecture.
  • Added support for FP8 on H100 using the runtime fusion engine. Support is currently limited to convolution forward. We will release code examples in the cuDNN frontend GitHub repository shortly.
  • Added support for the NVIDIA Ada Lovelace architecture.
  • Added support for the following new resampling modes:
    • Resampling forward: CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING
    • Resampling backward: CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING, CUDNN_RESAMPLE_AVGPOOL_INCLUDE_PADDING, and CUDNN_RESAMPLE_MAXPOOL

Fixed Issues

The following issues have been fixed in this release:
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX 58 for forward convolution, 63 for backwards data, and 62 for backwards filter used to falsely advertise the Tensor Core numerical note on SM 7.2 and SM 7.5 when running FP32 input, FP32 output, and FP32 accumulation convolutions. They are fixed in this release and correctly advertise non Tensor Core numerical notes.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX 58 for forward convolution, 63 for backwards data, and 62 for backwards filter used to allow tensor alignments of less than 16 bytes. To execute the advertised tensor core property, they have been fixed to require 16 byte alignment.
  • With the cuDNN version 8 backend API, CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 for forward convolution is not thread-safe when being executed simultaneously with multi-threads that share the same execution plan. This issue has been fixed in this release.
  • cudNN 8.5.0 introduced the approximated version of GELU for both forward and backward paths as new pointwise modes. While CUDNN_POINTWISE_GELU_APPROX_TANH_FWD introduced a performance improvement over CUDNN_POINTWISE_GELU_FWD, CUDNN_POINTWISE_GELU_APPROX_TANH_BWD was showing regressions compared to CUDNN_POINTWISE_GELU_BWD. This has now been addressed, and the approximated version of backward GELU is now showing slight improvements.

Known Issues

  • On Turing, Volta, Kepler, and Maxwell GPUs, 3D convolutions may exhibit some slowdowns when the padding size is larger than the filter size. 2D convolutions may encounter an illegal memory access error when the padding size is larger than the filter size and the horizontal stride is larger than 1. This issue will be resolved in the next release.
  • The performance of the runtime fusion engine is suboptimal on Windows.
  • Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • The cuDNN 8.6.0 library may exhibit some slowdowns in wgrad calculation for EfficientDet, EfficientNet, Mask R-CNN, ResNet, ResNeXt, and SSD layers when it was built with CUDA Toolkit 10.2.
  • cudnnNormalizationForwardTraining() does not currently support BFLOAT16. If an input tensor uses BFLOAT16, the API will return BAD_PARAM.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in the next CUDA Toolkit update.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used.
  • Users of cuDNN's CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING may see CUDNN_STATUS_BAD_PARAM returned for a problem that should otherwise be supported by that choice of algo.
  • cudnnDropoutForward() and cudnnDropoutBackward() will return incorrect results when input or output tensors have overlapping strides.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • In CUDA graph capture mode, CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream().
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Within the cuDNN version 8 backend API, the following engines are known to not be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 12. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX Fprop Dgrad Wgrad
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The cuDNN static builds load NVRTC dynamically when using the runtime fusion engine.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture. Furthermore, if used, training operations can succeed with a NaN loss for every epoch.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for batch normalization forwards training and batch normalization backwards may obtain incorrect results when batch size is greater than 1 and when channel count is not evenly divisible by 8. These values of CUDNN_ATTR_ENGINE_GLOBAL_INDEX correspond to newly added multi-GPU batch normalization support within cuDNN 8.5. Use of single-GPU batch normalization is unaffected by this issue. cuDNN will be revised to reject incorrectly supported multi-GPU batch normalization problems in a future release.

1.13. cuDNN Release 8.5.0

These are the NVIDIA cuDNN 8.5.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Achieved 30% reduction in library size by removing unused kernels. The current cuDNN 8.5.0 library size is 850 MB down from 1.2 GB compared to the 8.4.x releases.
  • Four new pointwise modes were added:
    • CUDNN_POINTWISE_GELU_APPROX_TANH_FWD and CUDNN_POINTWISE_GELU_APPROX_TANH_BWD, which are used for approximating GELU in the forward and backward pass, respectively.
    • CUDNN_POINTWISE_ERF, which can be used to piecewise create the GELU operator.
    • CUDNN_POINTWISE_IDENTITY, which can be used for explicitly converting between formats.
  • Improved graph API runtime compilation support:
    • Added support for performant adaptive pooling for NHWC layout supporting flexible I/O datatypes. It also supports large tensors with more than 4 trillion elements.
    • Added support for passing host scalars by value as B tensor in the pointwise operations.
    • Added support for generating code for more broadcasting patterns in the pointwise operations.
    • The CPU overhead associated with the subgraph execution has been reduced by 30 - 40%.
    • NVIDIA Ampere Architecture INT8 conv fusion heuristics have been updated to recommend more performant kernel configs for smaller problem sizes.
  • Added support for error reporting in the RNN APIs.
  • When using cuDNN builds against CUDA 11.x with cuBLAS version >= 11.6 U1, all kernels are now guaranteed to be launched in streams whose priorities match the user stream that is set by cudnnSetStream().
  • Documented operation specific constraints for the runtime fusion engine in the newly added Operation Specific Constraints for the Runtime Fusion Engine section.
  • Double precision support for the CTC loss.
  • Added support for Ubuntu 22.04 on x86_64 and AArch64 ARM. For more information, refer to the supported Linux versions of cuDNN section.
  • Added support for CUDA 11.7. For more information, refer to the GPU, CUDA Toolkit, and CUDA Driver Requirements section.
  • Added the cudnnBackendNormFwdPhase_t, cudnnBackendNormMode_t, CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR, CUDNN_BACKEND_OPERATION_NORM_BACKWARD_DESCRIPTOR, cudnnSignalMode_t, CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR, and CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR functions to the Backend API. These new operations help a cuDNN graph communicate and/or synchronize with another cuDNN graph possibly on a peer GPU. For more information, refer to the NVIDIA cuDNN Backend API documentation.
  • Added new data structure cudnnFraction_t to the Backend API. This more precisely describes the size ratio between the I/O images under fractional up/downsampling and adaptive pooling use cases. For more information, refer to the NVIDIA cuDNN Backend API documentation.

Fixed Issues

  • For packed NCHW tensors using the FP16 data-type, cuDNN attempted to run an optimized kernel if the values of N, C, H, and W were even. In cuDNN versions prior to 8.5, it was possible that incorrect values were generated if odd values for N or C were used. Starting in cuDNN 8.5, if an odd value for N or C is specified, cuDNN runs with an unoptimized kernel.
  • cuDNN was not enforcing the CUDNN_ATTR_EXECUTION_PLAN_HANDLE attribute for the CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR. It is now enforced in cuDNN 8.5.0. cudnnBackendFinalize() returns CUDNN_STATUS_BAD_PARAM if the handle attribute is not set.
  • Running depthwise convolutions in NHWC layout with CUDNN_CONVOLUTION mode and batch size >= 8 could produce incorrect results with cuDNN 8.1 and later. This has been fixed in this release.
  • Cases of folding transform which were not supported were erroring out with BAD_PARAM, this has been fixed to return the correct error code of NOT_SUPPORTED.
  • Improved runtime fusion heuristics for INT8 convolution, correcting small problem sizes.
  • Fixed an issue to ensure CUDNN_HEUR_MODE_B redirects to CUDNN_HEUR_MODE_A when unsupported. Frontend version 0.6.3 has been updated with a similar change to redirect CUDNN_HEUR_MODE_B to CUDNN_HEUR_MODE_A in older cuDNN versions when CUDNN_HEUR_MODE_B is not supported.
  • Fixed an issue in the runtime fusion engine where successive broadcasting patterns (for example, scalars broadcasting into vectors, then broadcasting into tensors) are not handled correctly and may produce wrong results.
  • In the build for CUDA 11.x, we fixed a couple of issues where some cuDNN internal streams were not guaranteed to match the priority of the stream set by cudnnSetStream. Now, all internal streams have that guarantee, except in the case of CUDA graph capture mode.
  • It was suggested that users of the static library requiring the best possible convolution performance use whole-archive linking with the cnn_infer and cnn_train static sub-libraries. This is no longer needed, however, this will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.

Performance Results

The following table shows the average speed-up of unique cuDNN 3D convolution calls for each network on V100 and A100 GPUs that satisfies the conditions in Recommended Settings section of the cuDNN Developer Guide. The end-to-end training performance will depend on a number of factors, such as framework overhead, kernel run time, and model architecture type.
Table 13. cuDNN version 8.5.0 compared to 8.4.1
Model Batchsize A100 8.5.0 vs V100 8.4.1 V100 8.5.0 vs V100 8.4.1
FP16 FP32 FP16 FP32
V-Net (3D-Image segmentation) 2 1.1x 2.9x 1.0x 1.0x
8 1.4x 3.4x 1.0x 1.0x
16 1.6x 3.8x 1.0x 1.1x
32 1.8x 3.7x 1.0x 1.0x
3D-UNet (3D-Image Segmentation) 2 2.1x 6.0x 1.0x 1.2x
4 2.1x 5.7x 1.0x 1.4x

Known Issues

  • cudnnNormalizationForwardTraining() does not currently support BFLOAT16. If an input tensor uses BFLOAT16, the API will return BAD_PARAM.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in the next CUDA Toolkit update.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used. This issue will be resolved in a future release.
  • Users of cuDNN's CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING may see CUDNN_STATUS_BAD_PARAM returned for a problem that should otherwise be supported by that choice of algo.
  • cudnnDropoutForward() and cudnnDropoutBackward() will return incorrect results when input or output tensors have overlapping strides.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • In CUDA graph capture mode, CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream().
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1025 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1001 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture. Furthermore, if used, training operations can succeed with a NaN loss for every epoch.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 11000 and 12000 may obtain incorrect results when batch size is greater than 1 and when channel count is not evenly divisible by 8. These values of CUDNN_ATTR_ENGINE_GLOBAL_INDEX correspond to newly added multi-GPU batch normalization support within cuDNN 8.5. Use of single-GPU batch normalization is unaffected by this issue. cuDNN will be revised to reject incorrectly supported multi-GPU batch normalization problems in a future release.

1.14. cuDNN Release 8.4.1

These are the NVIDIA cuDNN 8.4.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Improved runtime subgraph compilation support
    • Added support for CUDNN_BACKEND_OPERATION_RESAMPLE_FWD for the CUDNN_ATTR_RESAMPLE_MODE set to CUDNN_RESAMPLE_AVGPOOL and CUDNN_RESAMPLE_MAXPOOL through the runtime fusion engine. It can achieve up to 3x speed up compared to the legacy cudnnPoolingForward() API. Pointwise fusions to the output of this operation are also supported. Documentation about the patterns supported can be found in the Supported Graph Patterns section.
    • Newly added micro tile sizes for pointwise fusions that provide significantly improved performance on smaller problem sizes.
  • The NVIDIA cuDNN Developer Guide now includes an expanded section on supported patterns of the Graph API. It takes a systematic approach to explain which graph patterns are supported, along with various graphical examples, and details on some of the restrictions.

Fixed Issues

  • A buffer was shared between threads and caused segmentation faults. There was previously no way to have a per-thread buffer to avoid these segmentation faults. The buffer has been moved to the cuDNN handle. Ensure you have a cuDNN handle for each thread because the buffer in the cuDNN handle is only for the use of one thread and cannot be shared between two threads.
  • Fixed operation graph logging under cudnnBackendExecuteGraphVisualize() section upon calling cudnnBackendExecute() on generic fusion patterns. Added logging for CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR and CUDNN_BACKEND_MATMUL_DESCRIPTOR. Fixed logging for pointwise mode to show the enum value name.
  • Users specifying backend engines 58, 1063, 2062, and 4039 using CUDNN_ATTR_ENGINE_GLOBAL_INDEX with 1x1 convolutions and tensors with more than two GB elements (2G) would see CUDNN_STATUS_EXECUTION_FAILED in cuDNN 8.3.x. This issue has been fixed in this release.
  • cuDNN returned CUDNN_STATUS_EXECUTION_FAILED from cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), or cudnnConvolutionBackwardData() when computing convolutions with large spatial dimensions and batch sizes. This issue has been fixed. Such problems instead return CUDNN_STATUS_NOT_SUPPORTED where applicable.

Known Issues

  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • cuDNN is not enforcing the CUDNN_ATTR_EXECUTION_PLAN_HANDLE attribute for the CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR. This issue will be fixed in a future release.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used. This issue will be resolved in a future release.
  • Users of cuDNN's CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING may see CUDNN_STATUS_BAD_PARAM returned for a problem that should otherwise be supported by that choice of algo.
  • cudnnDropoutForward() and cudnnDropoutBackward() will return incorrect results when input or output tensors have overlapping strides.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream(). We recently discovered some issues that break our ability to document exceptions to this clearly.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking with the cnn_infer and cnn_train static sub libraries. This will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1025 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1001 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture. Furthermore, if used, training operations can succeed with a NaN loss for every epoch.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.

1.15. cuDNN Release 8.4.0

These are the NVIDIA cuDNN 8.4.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • API additions
    • Added API and support for the GEN_INDEX capability. CUDNN_POINTWISE_GEN_INDEX returns the position of an element in an input tensor along a given axis. This operation is similar to NumPy’s mesh grid operation as it returns a tensor with the index of all elements calculated according to the specified axis in the original tensor dimensions.
    • Added API and support for the BINARY_SELECT capability. CUDNN_POINTWISE_BINARY_SELECT is similar to the ternary operation and selects between two input elements based on a predicate element.
    • Experimentally supports serialization of execution plans to or from a string representation to enable the user to avoid recompilation of the fusion kernels. This feature only supports the runtime fusion engine currently. Generalized support for additional engines is planned for future releases.
  • Runtime fusion engine improvements
    • Previous versions of the runtime fusion engine only supported a minimum 128-bit alignment for tensors in all the operations. From this release onwards, the minimum alignment requirement has been relaxed down to 32 bit for input tensors in matrix multiplication and convolution for NVIDIA Ampere Architecture GPUs. For output tensors in any operation and input tensors for pointwise operations, the minimum alignment requirement has been relaxed down to 8 bit.
    • Added support for ARM servers.
  • Documentation improvements functions:

Fixed Issues

  • Users of cuDNN’s CUDNN_ATTR_ENGINE_GLOBAL_INDEX when set to 3000 previously could experience a floating point exception when the filter size (filter width * filter height) is greater than or equal to 32. This issue is fixed in this release.
  • Users of cuDNN's CUDNN_ATTR_ENGINE_GLOBAL_INDEX when set to 58, 1063, or 2062 may now use the knob count CUDNN_KNOB_TYPE_WORKSPACE to set the allowable workspace of these engines.
  • The documentation of cudnnNormalizationForwardInference() and cudnnBatchNormalizationForwardInference() has been improved for clarity.
  • Previous versions of cuDNN may produce wrong results when used to compute a matrix multiplication or fusions containing a matrix multiplication on NVIDIA Ampere Architecture based GPUs. This issue has been fixed in this release.

Known Issues

  • Users of cuDNN's CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING may see CUDNN_STATUS_BAD_PARAM returned for a problem that should otherwise be supported by that choice of algo.
  • cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), and cudnnConvolutionBackwardData() may generate illegal memory address errors on the NVIDIA Volta and NVIDIA Turing architectures. This issue existed in previous 8.3 releases as well.
  • cudnnDropoutForward() and cudnnDropoutBackward() will return incorrect results when input or output tensors have overlapping strides.
  • Users specifying backend engines 58, 1063, 2062, and 4039 using CUDNN_ATTR_ENGINE_GLOBAL_INDEX with 1x1 convolutions and tensors with more than two GB elements (2G) will see CUDNN_STATUS_EXECUTION_FAILED in cuDNN 8.3.x.
  • cuDNN may return CUDNN_STATUS_EXECUTION_FAILED from cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), or cudnnConvolutionBackwardData() when computing convolutions with large spatial dimensions and batch sizes. This issue will be addressed in a future release so that such problems will instead return CUDNN_STATUS_NOT_SUPPORTED where applicable.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1025 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine results in incorrect results.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream(). We recently discovered some issues that break our ability to document exceptions to this clearly.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking with the cnn_infer and cnn_train static sub libraries. This will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1001 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture. Furthermore, if used, training operations can succeed with a NaN loss for every epoch.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.

1.16. cuDNN Release 8.3.3

These are the NVIDIA cuDNN 8.3.3 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Various improvements were made in the runtime fusion engine:
    • Added heuristics for convolution + x fusion and matmul + x fusion for NVIDIA Volta and NVIDIA Turing architectures.
    • Updated the heuristics for matmul + x fusion for NVIDIA Ampere Architecture.
    • Small performance improvement for the matmul + x fusion.
    • Compilation time reduction.
  • Improved the performance for NHWC INT8 max pooling.
  • Updated the list of supported enums in the following data type references:
  • Updated and migrated the content from the Best Practices For Using cuDNN 3D Convolutions to the NVIDIA cuDNN Developer Guide. The Best Practices document has been deprecated.

Fixed Issues

The following issues have been fixed in this release:
  • Fixed an issue when fusing pointwise operation with a scalar (that is a [1, 1, 1, 1] shaped tensor) at the output of a matmul or a convolution. When the output is of integer type, the results may be inaccurate or wrong (due to float to INT8 truncation). After the fix, it will properly round to nearest with clamping.
  • Fixed an issue inside the batch norm finalize descriptor where an implementation detail was erroneously logged. Such unexpected access could intermittently cause a segment fault.
  • Convolution batch norm fusion engines invoked through the graph API only worked with cudnnBackendTensor descriptors with dimensions specified in “n,c,g,h,w” format. This has been fixed and cudnnBackendTensors with dimensions specified any of “n,c,h,w” and “n,c,g,h,w” formats can now be passed.
  • Fixed a numerical overflow issue in the computation of softplus activation function in the runtime fusion engine that was resulting in log(exp(x)) being computed as infinity for sufficiently large positive values of x.
  • In previous releases, cudnnTransformFilter() andcudnnTransformTensorEx() could produce wrong values at some pixels when doing a folding transform. This has been fixed in the current release.
  • Documentation has been updated for pooling forward and backward API functions. The documentation now discusses which data types and vectorizations are supported for the tensor descriptor arguments (this information was previously incomplete). For more information, refer to the cudnnPoolingForward() and cudnnPoolingBackward().

Known Issues

  • cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), and cudnnConvolutionBackwardData() may generate illegal memory address errors on the NVIDIA Volta and NVIDIA Turing architectures. This issue existed in previous 8.3 releases as well.
  • cudnnDropoutForward() and cudnnDropoutBackward() will return incorrect results when input or output tensors have overlapping strides.
  • Users specifying backend engines 58, 1063, 2062, and 4039 using CUDNN_ATTR_ENGINE_GLOBAL_INDEX with 1x1 convolutions and tensors with more than two GB elements (2G) will see CUDNN_STATUS_EXECUTION_FAILED in cuDNN 8.3.x.
  • cuDNN may return CUDNN_STATUS_EXECUTION_FAILED from cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), or cudnnConvolutionBackwardData() when computing convolutions with large spatial dimensions and batch sizes. This issue will be addressed in a future release so that such problems will instead return CUDNN_STATUS_NOT_SUPPORTED where applicable.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1025 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine results in incorrect results.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream(). We recently discovered some issues that break our ability to document exceptions to this clearly.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking with the cnn_infer and cnn_train static sub libraries. This will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1001 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.

Deprecated Features

The following features are deprecated in cuDNN 8.3.3:
  • We are deprecating the reporting of performance results in the Best Practices For Using cuDNN 3D Convolutions and will instead update these Release Notes if there is anything interesting to report release-over-release. Starting with cuDNN 8.4.0, this section will be removed. For past performance tables, refer to the NVIDIA cuDNN Archives.
  • Updated and migrated the content from the Best Practices For Using cuDNN 3D Convolutions to the NVIDIA cuDNN Developer Guide. The Best Practices document has been deprecated.

1.17. cuDNN Release 8.3.2

This is the NVIDIA cuDNN 8.3.2 Release Notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • In the runtime fusion engine, pointwise fusion for batched matmul was extended to support operations with full tensor in the epilog.

Announcements

Fixed Issues

  • cuDNN multihead attention produces incorrect results in case the postDropout feature is enabled. The issue has been fixed in this release.
  • Running convBiasAct in CUDNN_CROSS_CORRELATION mode could result in incorrect results if the GroupedDirect engine is selected.
  • Documentation has been updated for pooling forward and backward API functions, to update which data types and vectorizations are supported for the tensor descriptor arguments. For more information, refer to the cudnnPoolingForward() and cudnnPoolingBackward().
  • The documentation in the Reproducibility section in the NVIDIA cuDNN Developer Guide has been improved upon for clarity.
  • The documentation for the CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR, CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR, and CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR in the NVIDIA cuDNN API Reference has been improved upon for clarity.
  • Use of CUDNN_TENSOR_NCHW_VECT_C with cudnnReorderFilterAndBias() could generate incorrect results when the reordered filter data was used incorrectly within cuDNN. Direct use of cudnnConvolutionForward() or cudnnConvolutionBiasActivationForward() without cudnnReorderFilterAndBias() was unaffected by this issue.
  • Compared to cuDNN 8.3.0, there was an overall ~5% regression on convBiasAct layers on PG199/PG189. The maximum performance regression was around 3x for a select few cases. This issue has been fixed in this release.

Known Issues

  • cuDNN may return CUDNN_STATUS_EXECUTION_FAILED from cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), or cudnnConvolutionBackwardData() when computing convolutions with large spatial dimensions and batch sizes. This issue will be addressed in a future release so that such problems will instead return CUDNN_STATUS_NOT_SUPPORTED where applicable.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture.
  • Data gradient backendEngine 25 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine results in incorrect results.
  • The documentation for cudnnReorderFilterAndBias() needs some corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream(). We recently discovered some issues that break our ability to document exceptions to this clearly.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() allows both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking with the cnn_infer and cnn_train static sub libraries. This will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.

Deprecated Features

The following features are deprecated in cuDNN 8.3.2:
  • We are deprecating the reporting of performance results in the Best Practices For Using cuDNN 3D Convolutions and will instead update these Release Notes if there is anything interesting to report release-over-release. Starting with cuDNN 8.4.0, this section will be removed. For past performance tables, refer to the NVIDIA cuDNN Archives > Best Practices For Using cuDNN 3D Convolutions.

1.18. cuDNN Release 8.3.1

This is the NVIDIA cuDNN 8.3.1 Release Notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Announcements

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • In the runtime fusion engine:
    • Pointwise logical and comparison operators are now supported, including CUDNN_POINTWISE_LOGICAL_AND, CUDNN_POINTWISE_LOGICAL_OR, CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_CMP_EQ, CUDNN_POINTWISE_CMP_NEQ, CUDNN_POINTWISE_CMP_GT, CUDNN_POINTWISE_CMP_GE, CUDNN_POINTWISE_CMP_LT, and CUDNN_POINTWISE_CMP_LE. As part of this feature, support for loading/storing/computing with boolean tensors has also been added.
    • Batch support was added for matmul operation. Also, it is allowed to have batch broadcasting. The same matrix A or B can be broadcasted across the batch for matmul operation.
    • The leading dimension support (reflected in the strides of the tensors) was added for matmul operation. It is allowed to compute matmul operation with unpacked tensors.

Fixed Issues

The following issues have been fixed in this release:
  • cudnnConvolutionBiasActivationForward() could in some cases silently apply a ReLU operation when Identity was requested. This issue has been fixed in this release.
  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT_TILING, CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING, and CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING could exhibit illegal memory access in cuDNN v8 releases. This issue has been fixed in this release.
  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 was wrongly marked with numerical note CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION when output data type is float or double. This issue has been fixed in this release.
  • There was an error in the documentation for determinism of cudnnConvolutionBackwardFilter() by algo. This issue has been corrected in this release.
  • When the user selected algo0 (CUDNN_RNN_ALGO_STANDARD) in cudnnRNNBackwardData_v8() or invoked the legacy functions, such as cudnnRNNBackwardDataEx(), cudnnRNNBackwardData(), and the number of RNN layers was more than eight in a unidirectional model or more than four in a bidirectional model, then some internal streams used to parallelize computations may be default streams (aka stream 0). The computational performance would most likely be affected in those cases. This issue has been fixed in this release.
  • Calling cudnnSoftmaxForward() with CUDNN_SOFTMAX_MODE_CHANNEL mode and N==1 in NCHW layout would result in incorrect results in cuDNN 8.3.0. This has been fixed in this release.

Known Issues

  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • In general, the internal CUDA streams inside cuDNN will have the same priority as the user stream that is set by cudnnSetStream() (instead of always having default priority). There are two exceptions:
    1. When the user stream is in capture mode (that is, cudaStreamCaptureStatusActive==1), the cuDNN-owned streams will still have default priority, and
    2. RNN functions cudnnRNNForward(), cudnnRNNBackwardData_v8(), cudnnRNNBackwardWeights_v8(), and their legacy counterparts still use default priority CUDA streams or higher priority streams to launch concurrent and cooperative grids.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() allows both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking with the cnn_infer and cnn_train static sub libraries. This will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.
  • Compared to cuDNN 8.3.0, there is an overall ~5% regression on convBiasAct layers on PG199/PG189. The maximum performance regression is around 3x for a select few cases.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.

1.19. cuDNN Release 8.3.0

This is the NVIDIA cuDNN 8.3.0 release notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Announcements

  • cuDNN version 8.3.0 depends on cuBLAS as a shared library dependency.
  • The cuDNN version 8.3.0 libcudnn_static.a deliverable is replaced with the following:
    • libcudnn_ops_infer_static.a
    • libcudnn_ops_train_static.a
    • libcudnn_cnn_infer_static.a
    • libcudnn_cnn_train_static.a
    • libcudnn_adv_infer_static.a
    • libcudnn_adv_train_static.a
  • cuDNN version 8.3.0 depends on zlib as a shared library dependency. Refer to the zlib instructions in the NVIDIA cuDNN Installation Guide for instructions.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • WSL 2 is released as a preview feature in this cuDNN 8.3.0.
  • Various improvements were made in our multihead attention API:
    • Added HSH support - FP16 data type with FP32 math precision. Allow to achieve FP16 mixed-precision Tensor Core performance without sacrificing accuracy.
    • Added support to bias gradient computation. Before cuDNN 8.3.0, bias was supported only for inference.
    • Multihead attention has two dropout layers active in training mode. The first dropout operation is applied directly to the softmax output. The second dropout operation alters the multihead attention output, just before the point where residual connections are added. Before cuDNN 8.3.0, only the first dropout layer was supported.
    • Significant performance improvement out of the box (no changes are required from users) for both multihead attention forward and backward paths.
  • Various improvements were made in our runtime fusion engine:
    • The cuDNN runtime fusion engine now supports resample operations of upsample and downsample. Support has been added for 2*2 average pooling with stride 2 and upsample by a factor of 2 using bilinear interpolation. The datatype supported is FP32 and the compute datatype is FP32. The resample operations can also be fused with other operations provided the input to the resample operation is located in global memory.
    • The cuDNN runtime fusion engine is now generalized to accurately obey the intermediate storage datatype users specified in the operation graph. The support datatypes include INT8, BF16, FP16, INT32, FP32. As a general rule, we recommend users to use FP32 as the intermediate storage type that provides balanced numerical precision and performance.
    • The cuDNN runtime fusion engine now supports batched matmul operation with row/column broadcast or row/column reduction operations in the epilog.
    • The cuDNN runtime fusion engine now does numeric clamping while converting from a data type with a larger dynamic range to one with a relatively smaller dynamic range to avoid numeric overflows at all times.
    • The cuDNN runtime fusion engine extends the support for broadcast pointwise operations in the epilogue to now include those between a tensor and a scalar value as well.
    • Extended general fusion heuristic support to convolution forward and backward data and weight gradient operation patterns, with FP16, TF32, INT8 I/O data types, to ensure a good heuristic selection to improve out-of-the-box performance.
  • Added more detailed error reporting that is accessible from the existing API log or the logging callback function. Error and warning severity levels are added into the error reporting. Environment variables CUDNN_LOGERR_DBG and CUDNN_LOGWARN_DBG can be used to enable these severity levels respectively. Within these error severity levels, the error or warning message will now include a traceback of the error conditions triggered the error as hints for troubleshooting purposes.
  • Engine heuristics now supports a new mode called HEUR_MODE_FALLBACK which gives a list of engine configurations that run most of the convolution problems without the performance guarantee. Use this mode when all engines suggested by heuristics are not supported.
  • In prior cuDNN versions, certain engines required reordered filters for int8x32 format, but there was no way to disambiguate whether the filter was reordered. Engines that require reordered filters now have a new behavior note CUDNN_BEHAVIOR_NOTE_REQUIRES_FILTER_REORDER which specifies the tensors must be reordered before being passed to the engine.
  • RNN functions cudnnRNNBackwardData_v8(), cudnnRNNBackwardDataEx(), and cudnnRNNBackwardData()have been improved to internally invoke the cooperative group API cudaLaunchCooperativeKernel() to launch GPU kernels when threads must synchronize across all CUDA® thread blocks of a grid. Starting in CUDA 11.2, the cudaLaunchCooperativeKernel() function is able to run multiple cooperative grids concurrently in multiple streams. This feature has been used in CUDNN_RNN_ALGO_PERSIST_STATIC and CUDNN_RNN_ALGO_PERSIST_DYNAMIC algorithms to improve the computational performance. A method of launching these types of kernels using cudaLaunchCooperativeKernel() is more robust in preventing potential deadlocks when in rare scenarios when multiple cooperative grids are launched concurrently.

    cuDNN 8.3 compiled with CUDA 10.2 must still rely on a regular method of launching kernels. Deadlocks are mitigated by employing higher priority CUDA streams. Currently, cuDNN RNN APIs still use higher priority streams, however, in future cuDNN versions, priorities of auxiliary streams will match the priority of the user stream defined by the cudnnSetStream() call. Future cuDNN versions will also use the cudaLaunchCooperativeKernel() API to launch cooperative grids in forward RNN functions such as cudnnRNNForward().

Fixed Issues

The following issues have been fixed in this release:
  • cudnnAddTensor() did not support some tensor shapes that were previously specified as supported. This issue has been fixed in this release.
  • When using the cuDNN CTC Loss API function, the computed gradients array was not zero initialized. This meant it was possible the gradients array returned a mix of valid values and uninitialized values. This issue has been fixed in this release.
  • Compared to version 8.0.5, legacy convolution APIs increased CPU computational costs. On x86, this was measured to be as high as 10 microseconds. This issue has been fixed in this release.
  • cudnnSetStream() API was generating errors when graph capture was enabled. This issue is fixed in this release.
  • There was a known 60% performance regression for ResNet-50 on the GTX 1080 when run using FP16 data with large batch sizes (over 128). This regression has been fixed in this release.
  • In some cases, cudnnConvolutionBackwardFilter() generates numerically imprecise results when used with algo set to CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1. This issue is most frequently encountered with three-dimensional spatial tensors. Users of the backend API may explicitly avoid backend engine 2032 or consider the numerical notes of engines and reject any marked as offering reduced precision reduction (CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION).
  • For cudnnConvolutionBiasActivationForward(), there was previously a restriction on aliasing device memory pointers labeled X and Z in the documentation of that function. This restriction has been relaxed so that X may alias Z by pointing to the same device memory location if desired. Note that the restriction against aliasing the pointers labeled X and Y remains.
  • cuDNN may be observed to contain a small leak related to the use of dlopen. Currently, this is believed to be a false positive when indicated by valgrind. Should this thinking change, the known issues of this document will reflect that understanding in subsequent releases.
  • Previously, on NVIDIA Pascal and Maxwell architectures, users of cuDNN's 8.X's backend engine 34 with CUDNN_CONVOLUTION mode set for forward convolution could witness-illegal memory access when this engine is specifically selected outside of heuristic query. This issue has been fixed in this release.
  • Previously, on K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half I/O data types a silent error might occur when the output width Q is 1 and both height and width padding are zero; this particular case will now be rejected by cuDNN as not supported in this and all other successive releases for this GPU architecture.
  • cuDNN does not package libfreeimg as a static library for users of cuDNN's MNIST sample code. The included readme.txt file contains instructions on where to locate this dependency and how to compile and link this sample.
  • The parameters section for cudnnBatchNormalizationForwardInference() has been updated to reflect a correct *y description.
  • Compared to cuDNN 7.6.5, there was a known performance regression on various convolutional models using INT8 data types on NVIDIA Volta GPUs. This issue has been fixed in this release.
  • A memory leak as well as a possible delayed memory deallocation in the cuDNN runtime fusion engine have been fixed.
  • In previous releases of cuDNN 8, user applications might crash in rare instances due to large stack allocation requirements; this issue is fixed in this release by preferring heap allocation in cases where large stack allocations were previously occurring.
  • Some dgrad batchnorm fusion engines were previously not supported on Windows. We now support this starting in cuDNN 8.3.0.
  • The documentation for cudnnReorderFilterAndBias() needed some corrections for clarity. The topic has been updated in this release.

Known Issues

  • When the user selects algo0 (CUDNN_RNN_ALGO_STANDARD) in cudnnRNNBackwardData_v8() or invokes the legacy functions, such as cudnnRNNBackwardDataEx(), cudnnRNNBackwardData(), and the number of RNN layers is more than eight in a unidirectional model or more than four in a bidirectional model, then some internal streams used to parallelize computations may be default streams (aka stream 0). RNN algo0 dgrad APIs will not crash and the numerical results will be correct but the computational performance will likely be affected in those cases.
  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size that will require resolution in future releases, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • The internal CUDA streams inside cuDNN 8.3.0 will have the same priority as the user stream that is set by cudnnSetStream() (instead of always having default priority). There are two limitations:
    1. When the user stream is in capture mode (that is, cudaStreamCaptureStatusActive==1), the cuDNN-owned streams will still have default priority, and
    2. RNN functions cudnnRNNForward(), cudnnRNNBackwardData_v8(), cudnnRNNBackwardWeights_v8(), and their legacy counterparts still use default priority CUDA streams or higher priority streams to launch concurrent and cooperative grids.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() allows both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Calling cudnnSoftmaxForward() with CUDNN_SOFTMAX_MODE_CHANNEL mode and N==1 in NCHW layout may result in incorrect results. This will be fixed in the next release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory that might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.

1.20. cuDNN Release 8.2.4

This is the cuDNN 8.2.4 release notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and NVIDIA JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the