NVIDIA cuDNN Documentation
Release Notes (PDF) - Last updated July 10, 2023

cuDNN Release Notes

NVIDIA CUDA Deep Neural Network (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. It provides highly tuned implementations of routines arising frequently in DNN applications. These release notes describe the key features, software enhancements and improvements, and known issues for the NVIDIA cuDNN 8.9.3 and earlier releases.

1. cuDNN Release 8.x.x

1.1. cuDNN Release 8.9.3

These are the NVIDIA cuDNN 8.9.3 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Major improvements in the FP16/BF16 fused flash attention engine:
    • The supported layout of Q, K, V and O tensors and their gradients have been generalized. Now, the only requirement is that the d dimension must have stride 1. All other dimensions can have arbitrary strides.
    • The engine now supports both self attention and cross attention.
    • The engine now allows online generation of padding mask and causal mask; each one can be turned on/off independently.
    • The engine now supports an extra additive bias passed in as a tensor after the first batch MatMul and before the Softmax, which can be used to pass in customized masks (for example, Alibi).
    • The engine now supports multi-query attention where the headcount dimension of K and V tensors are 1
    • The engine now supports (padded) variable sequence length computation.
    • The engine now supports head dimensions to be 64 or 128.
    • The engine now supports arbitrary sequence length, batch size, and headcount.
    • The engine now supports both dropouts turned on and off.
    • The engine now supports the dropout mask to be accepted as a tensor rather than being generated inside the kernel.
    • The engine now supports the mask before Softmax to be an input tensor created by the user, rather than being online generated.
  • Improved CUDNN_HEUR_MODE_A heuristic recommendations for depthwise convolution on NVIDIA Hopper GPUs.
  • Improved error reporting during cuDNN handle creation. For more information on how to enable error reporting, refer to the NVIDIA cuDNN Developer Guide.

Fixed Issues

The following issues have been fixed in this release:
  • Use of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 17 for fused convolutions with bias and ReLU could generate incorrect results on the Maxwell GPU architectures. This issue is now fixed in 8.9.3.
  • Running Scale-Bias-Add-Relu-Conv-GenStats (refer to ConvBNfprop) with CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 could generate incorrect results on NVIDIA Ampere GPUs. This applies to whether dual tensors are set or not in the cited pattern. This pattern is no longer supported on NVIDIA Ampere GPUs.
  • A bug in FP16/BF16 fused flash attention for large problem sizes, that is, when b * h * s * d > 512 * 64 * number of SMs, which was resulting in incorrect values being computed, has been fixed.

Known Issues

  • A race condition in memory write accesses was flagged by the "compute-sanitizer" tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. Extensive testing on H100, with different clock speeds and computational loads, did not reveal any impact on functional results that were always identical and correct. This issue is currently being investigated.
  • The FP8 fused flash attention is known to be slower than FP16 fused flash attention for small batch sizes.
  • cuDNN's usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors. This issue has been fixed in this release.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 1. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.2. cuDNN Release 8.9.2

These are the NVIDIA cuDNN 8.9.2 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Improvements in the fused flash attention runtime fusion engine:
    • Added new performance knobs for the engine.
    • Updated heuristics for the engine for both fprop and bprop.
  • Extended fused (non-flash) attention engine to accept dropout mask as an input rather than being generated inside the kernel.
  • Improved the performance of the runtime fusion engine for the NVIDIA Hopper architecture and improved the compilation time.
  • Updated runtime fusion engine heuristics for FP16 and FP8 convolution fprop, dgrad, and Matmul operations to support new kernels on the NVIDIA Hopper architecture.
  • Relaxed the limitation that only one reduction operation is allowed in the DAG of g2 of Generic Runtime Fusion Engines.

Fixed Issues

The following issues have been fixed in this release:
  • Use of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 15 for fused convolutions with bias and ReLU could generate incorrect results on the Pascal GPU architectures for cuDNN 8.9 releases. This issue is now fixed in 8.9.2. Users of cudnnConvolutionBiasActivationForward would have been similarly affected.
  • Additional fixes were implemented in cudnnRNNBackwardWeights_v8() to harden the process of transferring the variable sequence length array from the RNN data descriptor to device memory. As in cuDNN 8.9.1, the const int32_t devSeqLengths[] argument is no longer used and can be set to NULL by the user.

Known Issues

  • The FP8 fused flash attention is known to be slower than FP16 fused flash attention for small batch sizes.
  • cuDNN's usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors. This issue has been fixed in this release.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 2. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.3. cuDNN Release 8.9.1

These are the NVIDIA cuDNN 8.9.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Improved library-wide error reporting coverage by providing the triggered error condition for CUDNN_STATUS_BAD_PARAM and CUDNN_STATUS_NOT_SUPPORTED in the error and warning level logging. The majority of error logs now include not just the error code, but also the error "reason" (that is, the condition that failed). Refer to Error Reporting And API Logging for instructions to enable the error reporting feature.
  • Expanded support of fused flash attention inference and training to sequence lengths of multiples of 64 and hidden dimension of 64 and 128.
  • Expanded support of fused attention to accept a general mask as an input for training, added backward pass for relative positional encoding, and added support for sequence lengths of multiples of 64 (not multiples of 128) up to 512.
  • Fine-tuned runtime fusion heuristics for NVIDIA Ada architecture for FP16 and INT8 convolution forward propagation.
  • HEUR_MODE_A supports BnAddRelu and DReluForkDBn patterns. The support is currently limited to operation graphs with single node multi-GPU batch norms without any pointwise node fusions.

Fixed Issues

The following issues have been fixed in this release:
  • Corrected the engine config returned by heuristics for FP8 convolution.
  • cudnnGetConvolutionForwardAlgorithm_v7(), cudnnGetConvolutionBackwardDataAlgorithm_v7(), and cudnnGetConvolutionBackwardFilterAlgorithm_v7() would exhibit a memory leak on NVIDIA Hopper GPUs. This issue has been fixed in this release.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempted to run an optimized kernel if the values of N, C, H, and W were even. In cuDNN versions before 8.4, it was possible that incorrect values were generated if odd values for the strides of N or C were used. This issue has been fixed in this release.
  • cuDNN 8.9.1 added tensor alignment checks to instance norm and layer norm engines to prevent IMA issues.
  • Starting in cuDNN 8.9.1, the const int32_t devSeqLengths[] argument in cudnnRNNForward(), cudnnRNNBackwardData_v8(), and cudnnRNNBackwardWeights_v8() APIs will be ignored. All three functions will source variable sequence length arrays from RNN data descriptors, configured through the seqLengthArray parameter of cudnnSetRNNDataDescriptor(). The user does not need to transfer this array to device memory; the operation will be performed automatically by RNN APIs. This refinement simplifies the usage of cuDNN RNN APIs. It is also a workaround for random crashes in multi-GPU RNN training on TensorFlow. Replacing earlier versions of cuDNN 8.x shared libraries with cuDNN 8.9.1 will eliminate those crashes without forcing the user to switch the TensorFlow version. The cause of intermittent corruptions of devSeqLengths[], fed to RNN APIs, is still being investigated.
  • Compared to cuDNN 7.6, there were known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs. This issue has been fixed in this release.

Known Issues

  • cuDNN's usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors. This issue has been fixed in this release.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 3. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.4. cuDNN Release 8.9.0

These are the NVIDIA cuDNN 8.9.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Added support for transformer models training and inference using Flash Attention in cuDNN runtime fusion engine. For more information, refer to the flash fused multi-head attention fprop and bprop patterns listed in the NVIDIA cuDNN Developer Guide.
  • Added support for FP8 fused-multi-head attention training and inference support targeting BERT on NVIDIA Hopper GPUs.
  • The runtime fusion engine has been split into four engines (with engine id 0, 1, 2, and 3) according to functional coverage and device architecture.
    • Engine id 0 targets general runtime fusion for the NVIDIA Volta and NVIDIA Turing architectures.
    • Engine id 1 targets general runtime fusion for the NVIDIA Ampere architecture.
    • Engine id 2 targets general runtime fusion for the NVIDIA Hopper architecture.
    • Engine id 3 targets multihead attention related fusions for the NVIDIA Ampere and NVIDIA Hopper architectures.

    This split has enabled the heuristics to have more control over kernel selection. Knob information is now more precise for each splitted engine. As a result, internal testing coverage has also improved.

  • A new runtime fusion engine (with engine id 4) has been introduced to enable up to 4x better compilation time and better epilogue fusion efficiency. It’s currently recommended by heuristics for a subset of patterns and problem sizes. We expect the coverage to grow further in future releases.
  • Improved FP8 Dgrad heuristics to have generally good out-of-the-box support for both CUDNN_DATA_FLOAT and CUDNN_DATA_FAST_FLOAT_FOR_FP8 compute type.
  • Enabled a new knob CUDNN_KNOB_TYPE_SPLIT_K_SLC in runtime fusion engines that can be used to control Split-K in SM 7.X, SM 8.X or Segment-K in SM 9.X convolution or GEMM kernels.
  • Added TF32 heuristics for runtime fusion for the NVIDIA Hopper architecture.
  • Added support for the DgradDreluBNBwdWeight fusion pattern on NVIDIA Hopper GPUs. For more information, refer to the DgradDreluBNBwdWeight pattern listed in the NVIDIA cuDNN Developer Guide.
  • Two new patterns were added to the ConvBNFprop fusion patterns. These are supported on NVIDIA Hopper GPU’s. For more information, refer to the ConvBNFprop pattern listed in the NVIDIA cuDNN Developer Guide.
  • Added a new runtime compiled engine with some fusion capabilities to support single GPU and single node multi-GPU batch normalization. Previously, this engine was statically compiled, but it’s now a runtime compiled engine in cuDNN 8.9. For more information, refer to the BNAddRelu pattern listed in the NVIDIA cuDNN Developer Guide.

Fixed Issues

The following issues have been fixed in this release:
  • When cuDNN executed FP8 matrix multiplication and convolution kernels with compute type CUDNN_DATA_FLOAT, the numerical precision was lower than expected. This issue has been fixed in this release.
  • Some engines incorrectly returned a success status when calling cudnnBackendFinalize() with a backend engine descriptor for a bfloat16 problem on hardware that did not support the bfloat16 type. Attempted execution of the problem with cudnnBackendExecute() would return the expected status CUDNN_STATUS_ARCH_MISMATCH. This issue has been fixed in this release.

Known Issues

  • cudnnGetConvolutionForwardAlgorithm_v7(), cudnnGetConvolutionBackwardDataAlgorithm_v7(), and cudnnGetConvolutionBackwardFilterAlgorithm_v7() exhibit a memory leak on NVIDIA Hopper GPUs.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
  • When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the NVIDIA cuDNN Support Matrix for the exact supported CUDA versions.
  • In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 4. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.5. cuDNN Release 8.8.1

These are the NVIDIA cuDNN 8.8.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Fixed Issues

  • In cuDNN 8.8.0, the runtime fusion engine failed with NVRTC from CUDA 12.1. This bug is fixed in cuDNN 8.8.1, but the limitation still remains that the runtime fusion engine is not forward compatible beyond CUDA 12.1. This limitation will be addressed in a future release.

Known Issues

  • Some engines will incorrectly return a success status when calling cudnnBackendFinalize() with a backend engine descriptor for a bfloat16 problem on hardware that does not support the bfloat16 type. Attempted execution of the problem with cudnnBackendExecute() will return the expected status CUDNN_STATUS_ARCH_MISMATCH.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • The cuDNN 8.7.0 library may exhibit some slowdowns in wgrad calculation for EfficientDet, EfficientNet, Mask R-CNN, ResNet, ResNeXt, and SSD layers when it was built with CUDA Toolkit 10.2.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
  • On Windows 10, the knob settings of CUDNN_CONVOLUTION_CUTLASS_ANALYTIC_16816_NHWC_ENGINE and CUDNN_CONVOLUTION_IMPLICIT_PRECOMPUTED_GEMM_CUTLASS_16816_NHWC_ENGINE are incorrect. Therefore, these two engines cannot be configured properly. It will impact the performance of convolution cases that use them on a Windows system.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • In cuDNN 8.8.1 for CUDA 12.x, runtime fusion engines will only work with NVRTC from CUDA Toolkit 12.0 and 12.1. It is not forward compatible with future CUDA 12.x Toolkits.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 5. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.6. cuDNN Release 8.8.0

These are the NVIDIA cuDNN 8.8.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Announcements

  • cuDNN 8.7.0 was the last release supporting NVIDIA Kepler (SM 3.x) devices. It has been removed in cuDNN 8.8.0.
  • cuDNN 8.8.0 changed the linking procedure of NVRTC in the static build. cuDNN 8.8.0 requires NVRTC to be statically linked in the static build rather than the previous dynamic linking. This will also remove support for static linking of cuDNN with CUDA toolkits < 11.5. Refer to the updated instructions in the NVIDIA cuDNN Installation Guide. There were no changes for the linking procedure in the dynamic build.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • This release adds CUDA 12 support.
  • Added three precompiled engines for instance normalization: instance normalization forward training engine, instance normalization forward inference engine, and instance normalization backward engine.
  • Added three precompiled engines for layer normalization: layer normalization forward training engine, layer normalization forward inference engine, and layer normalization backward engine.
  • Added further fusion patterns possible in Mha-Fprop fusions, which target causal masking, relative positional embedding bias, and cross attention giving cuDNN full support for the T5 model. For more information, refer to the Mha-Fprop fusion pattern in the NVIDIA cuDNN Developer Guide.
  • Added support for Mha-Bprop fusions using the runtime fusion engine, targeting MHA training. For more information, refer to the Mha-Bprop fusion pattern in the NVIDIA cuDNN Developer Guide. The cuDNN implementation provides a speedup of ~2x-3x in BERT and T5 patterns in training over native unfused PyTorch. Samples can also be found in the cuDNN frontend repository.
  • Double precision is supported for the CTC loss in the cuDNN 8.x API by cudnnCTCLoss_v8(), cudnnSetCTCLossDescriptor_v8() and cudnnGetCTCLossWorkspaceSize_v8(), and in all other cuDNN API versions by cudnnCTCLoss(), cudnnSetCTCLossDescriptor() and cudnnGetCTCLossWorkspaceSize().
  • CUDNN_HEUR_MODE_A supports the following new operation graphs: ConvBNfprop, ConvBNwgrad, and DgradDreluBNBwdWeight.
  • The dBNapply and DualdBNapply patterns are now generally supported by the runtime fusion engine. Heuristics query for these two patterns are also now supported. This provides more flexible support in data types and also provides a 5% speed up compared to the original pre-compiled specialized engines for these patterns across a wide range of workloads.
  • Improved performance for 1D NHWC depthwise convolution.
  • Improved performance for Tensor Core accelerated FP32 operations on NVIDIA Hopper.

Fixed Issues

  • Use of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for convolution, backward data, and backward filter batch normalization fusions resulted in a performance regression in cuDNN v8.7 on NVIDIA Ampere architecture. This has been improved upon in this release.
  • For NCHW spatially packed input tensors, cudnnBatchNormalizationForwardInference() with mode CUDNN_BATCHNORM_SPATIAL or CUDNN_BATCHNORM_SPATIAL_PERSISTENT can now support up to size 2147483136 (2^31-512) in the spatial dimension (D)HW. For NCHW spatially packed input tensors, cudnnBatchNormalizationBackward() with mode CUDNN_BATCHNORM_SPATIAL or CUDNN_BATCHNORM_SPATIAL_PERSISTENT can now support up to size 2147483136 (2^31-512) in the spatial dimension (D)HW for some cases.
  • Use of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for batch normalization forwards training and batch normalization backwards could generate illegal memory access errors for large tensors. This is fixed in this release by limiting the input tensor size to be less than or equal to 2^31.
  • Documentation was updated for the API function cudnnReorderFilterAndBias() to list other possible return statuses and what would cause them to be returned.
  • The cuDNN static builds no longer dynamically load NVRTC, and therefore requires static linking of NVRTC instead.
  • Fixed incorrect RNN results on Pascal family GPUs with 6.0 or 6.1 compute capability, also in the CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H algorithm, in the RNN forward and backward APIs, when the input data minibatch was divisible by eight, and for certain hidden size values.

Known Issues

  • Some engines will incorrectly return a success status when calling cudnnBackendFinalize() with a backend engine descriptor for a bfloat16 problem on hardware that does not support the bfloat16 type. Attempted execution of the problem with cudnnBackendExecute() will return the expected status CUDNN_STATUS_ARCH_MISMATCH.
  • Use of cudnnFusedOpsExecute() on NVIDIA Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • The cuDNN 8.7.0 library may exhibit some slowdowns in wgrad calculation for EfficientDet, EfficientNet, Mask R-CNN, ResNet, ResNeXt, and SSD layers when it was built with CUDA Toolkit 10.2.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
  • On Windows 10, the knob settings of CUDNN_CONVOLUTION_CUTLASS_ANALYTIC_16816_NHWC_ENGINE and CUDNN_CONVOLUTION_IMPLICIT_PRECOMPUTED_GEMM_CUTLASS_16816_NHWC_ENGINE are incorrect. Therefore, these two engines cannot be configured properly. It will impact the performance of convolution cases that use them on a Windows system.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • cuDNN 8.8.0 runtime fusion will only work with NVRTC from CUDA ToolKit 12.0. It is not forward compatible with CUDA Toolkit 12.1 and later.
  • Within the cuDNN version 8 backend API, the following engines are known not to be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 6. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

Deprecated and Removed Features

The following features are deprecated in cuDNN 8.8.0:
  • cuDNN 8.7.0 was the last release supporting NVIDIA Kepler (SM 3.x) devices. It has been removed in cuDNN 8.8.0.

1.7. cuDNN Release 8.7.0

These are the NVIDIA cuDNN 8.7.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

  • Added the cudnnRngDistribution_t, CUDNN_BACKEND_OPERATION_RNG_DESCRIPTOR and CUDNN_BACKEND_RNG_DESCRIPTOR functions to the Backend API. This new operation helps a cuDNN graph to create a tensor using a probability distribution which can then be used as an input to other operations. For example, it can be used as a mask in dropout. Currently, it has limited support via the runtime fusion engine in particular patterns. For more information refer to the NVIDIA cuDNN Developer Guide. Support will be extended in future versions. For more information, refer to the NVIDIA cuDNN Backend API documentation.
  • Added support for MatMul-MatMul fusions via the runtime fusion engine, targeting MHA inference. For more information, refer to the MatMul-MatMul fusion pattern in the NVIDIA cuDNN Developer Guide. The cuDNN implementation provides a speedup of ~4x-4.5x in BERT and T5 patterns in inference over native unfused PyTorch.
  • Added native NVIDIA Hopper support for matrix multiplication and its fusions in FP16 mixed precision (FP16 I/O with FP32 compute), which improves performance of matmul ops on Hopper compared to cuDNN 8.6.0.
  • Added FP8 input support for convolution backward data and backward weights operations, with two possible compute precision types CUDNN_DATA_FLOAT and CUDNN_DATA_FAST_FLOAT_FOR_FP8 (faster but lower precision).
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Added support for ConvBNfprop and ConvBNwgrad fusion patterns on NVIDIA Hopper GPU’s. For more information, refer to the ConvBNfprop and ConvBNwgrad patterns listed in the NVIDIA cuDNN Developer Guide.
  • Additional tensor layout support was added for the forward and backwards resampling modes that were added in cuDNN 8.6.0.

Fixed Issues

  • The backend engine 6000 was not respecting the CUDNN_KNOB_TYPE_TILE_SIZE knob that was passed by the user and it had runtime failures on Windows. Both of these issues have been fixed in this release.
  • In CUDA graph capture mode, CUDA streams internal to cuDNN were not guaranteed to have the same priority as the user stream that is set by cudnnSetStream(). This issue is fixed in cuDNN 8.7.0, but requires CUDA 11.8 or later.
  • On Turing, Volta, Kepler, and Maxwell GPUs, 3D convolutions used to exhibit some slowdowns when the padding size was larger than the filter size. 2D convolutions used to encounter an illegal memory access error when the padding size was larger than the filter size and the horizontal stride was larger than 1. These issues have been fixed in this release.
  • The performance of the runtime fusion engine was suboptimal on Windows. This has been fixed in this release.
  • Users of cuDNN's CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING could see CUDNN_STATUS_BAD_PARAM returned for a problem that should otherwise be supported by that choice of algo. This has been fixed in this release.
  • cudnnDropoutForward() and cudnnDropoutBackward() would return incorrect results when input and/or output tensors have overlapping strides. This issue has been fixed in this release.
  • Users of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for batch normalization forwards training and batch normalization backwards could obtain incorrect results when the batch size was greater than 1 and when channel count was not evenly divisible by 8. These values of CUDNN_ATTR_ENGINE_GLOBAL_INDEX correspond to newly added multi-GPU batch normalization support within cuDNN 8.5. Use of single-GPU batch normalization was unaffected by this issue. This limitation has been fixed in this release.
  • In cuDNN 8.5 built with CUDA 11.x, all RNN APIs started to use internal CUDA streams of the same priority as the user stream passed through the cudnnSetStream() function. This update introduced a bug. When the internal heuristic decided to transpose RNN weights in cudnnRNNForward()or corresponding, deprecated functions: cudnnRNNForwardInference(), cudnnRNNForwardTraining(), cudnnRNNForwardInferenceEx(), cudnnRNNForwardTrainingEx(), and cudnnRNNAlgo_t was set to CUDNN_RNN_ALGO_STANDARD, the matrix transposition was performed in the synchronous stream 0 instead of a CUDA stream of the same priority as user stream. This caused serial execution of transpose kernels and excessive synchronization in the initial stage of the forward API. The bug affected cuDNN 8.5 and 8.6 built with both CUDA 11.x and CUDA 10.2. This issue has been fixed in this release.
  • cudnnNormalizationForwardTraining() did not support BFLOAT16. If an input tensor used BFLOAT16, the API would return BAD_PARAM. This issue has been fixed in this release.

Known Issues

  • Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • The cuDNN 8.7.0 library may exhibit some slowdowns in wgrad calculation for EfficientDet, EfficientNet, Mask R-CNN, ResNet, ResNeXt, and SSD layers when it was built with CUDA Toolkit 10.2.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
  • On Windows 10, the knob settings of CUDNN_CONVOLUTION_CUTLASS_ANALYTIC_16816_NHWC_ENGINE and CUDNN_CONVOLUTION_IMPLICIT_PRECOMPUTED_GEMM_CUTLASS_16816_NHWC_ENGINE are incorrect. Therefore, these two engines cannot be configured properly. It will impact the performance of convolution cases that use them on a Windows system.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Within the cuDNN version 8 backend API, the following engines are known to not be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 7. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX convolution forward convolution backward data convolution backward filter cudnnConvolutionBiasActivationForward
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    • 4024
    • 4026
    • 4032
    • 4033
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    • 4002
    • 4015
    • 4008
    • 4018
    • 4019
    • 4020
    • 4030
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
    • 4015
    • 4003
    • 4004
    • 4009
    • 4018
    • 4019
    • 4020
    • 4031
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The cuDNN static builds load NVRTC dynamically when using the runtime fusion engine.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of cudnn_cnn_infer_static.a may need to update their application linkage so that symbols absent in that library are subsequently made available with cudnn_ops_infer_static.a. On Linux, this is specifying the ops library after cnn on the linker line. The same applies to cudnn_cnn_train_static.a and cudnn_ops_train_static.a.

1.8. cuDNN Release 8.6.0

These are the NVIDIA cuDNN 8.6.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Added support for the NVIDIA Hopper™ (H100) architecture.
  • Added support for FP8 on H100 using the runtime fusion engine. Support is currently limited to convolution forward. We will release code examples in the cuDNN frontend GitHub repository shortly.
  • Added support for the NVIDIA Ada Lovelace architecture.
  • Added support for the following new resampling modes:
    • Resampling forward: CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING
    • Resampling backward: CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING, CUDNN_RESAMPLE_AVGPOOL_INCLUDE_PADDING, and CUDNN_RESAMPLE_MAXPOOL

Fixed Issues

The following issues have been fixed in this release:
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX 58 for forward convolution, 63 for backwards data, and 62 for backwards filter used to falsely advertise the Tensor Core numerical note on SM 7.2 and SM 7.5 when running FP32 input, FP32 output, and FP32 accumulation convolutions. They are fixed in this release and correctly advertise non Tensor Core numerical notes.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX 58 for forward convolution, 63 for backwards data, and 62 for backwards filter used to allow tensor alignments of less than 16 bytes. To execute the advertised tensor core property, they have been fixed to require 16 byte alignment.
  • With the cuDNN version 8 backend API, CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 for forward convolution is not thread-safe when being executed simultaneously with multi-threads that share the same execution plan. This issue has been fixed in this release.
  • cudNN 8.5.0 introduced the approximated version of GELU for both forward and backward paths as new pointwise modes. While CUDNN_POINTWISE_GELU_APPROX_TANH_FWD introduced a performance improvement over CUDNN_POINTWISE_GELU_FWD, CUDNN_POINTWISE_GELU_APPROX_TANH_BWD was showing regressions compared to CUDNN_POINTWISE_GELU_BWD. This has now been addressed, and the approximated version of backward GELU is now showing slight improvements.

Known Issues

  • On Turing, Volta, Kepler, and Maxwell GPUs, 3D convolutions may exhibit some slowdowns when the padding size is larger than the filter size. 2D convolutions may encounter an illegal memory access error when the padding size is larger than the filter size and the horizontal stride is larger than 1. This issue will be resolved in the next release.
  • The performance of the runtime fusion engine is suboptimal on Windows.
  • Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
  • The cuDNN 8.6.0 library may exhibit some slowdowns in wgrad calculation for EfficientDet, EfficientNet, Mask R-CNN, ResNet, ResNeXt, and SSD layers when it was built with CUDA Toolkit 10.2.
  • cudnnNormalizationForwardTraining() does not currently support BFLOAT16. If an input tensor uses BFLOAT16, the API will return BAD_PARAM.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in the next CUDA Toolkit update.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used.
  • Users of cuDNN's CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING may see CUDNN_STATUS_BAD_PARAM returned for a problem that should otherwise be supported by that choice of algo.
  • cudnnDropoutForward() and cudnnDropoutBackward() will return incorrect results when input or output tensors have overlapping strides.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • In CUDA graph capture mode, CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream().
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Within the cuDNN version 8 backend API, the following engines are known to not be thread-safe when executed simultaneously with multiple threads sharing the same execution plan:
    Table 8. Engines That Are Not Thread-Safe
    CUDNN_ATTR_ENGINE_GLOBAL_INDEX Fprop Dgrad Wgrad
    A100
    • 36
    • 38
    • 45
    • 46
    • 47
    • 48
    • 1
    • 2
    • 3
    • 19
    • 22
    • 25
    • 26
    • 28
    • 40
    • 46
    • 51
    • 56
    • 57
    • 58
    • 59
    • 60
    • 65
    • 9
    • 10
    • 21
    • 23
    • 33
    • 37
    • 47
    • 48
    • 49
    • 50
    V100
    • 8
    • 9
    • 10
    • 12
    • 16
    • 26
    • 30
    • 31
    • 34
    • 42
    • 49
    • 1
    • 2
    • 3
    • 8
    • 12
    • 13
    • 19
    • 21
    • 25
    • 26
    • 29
    • 37
    • 38
    • 44
    • 6
    • 7
    • 21
    • 35
    • 36
    • 43
    • 44
    • 51
    • 52
    T4
    • 9
    • 12
    • 13
    • 14
    • 16
    • 18
    • 26
    • 30
    • 31
    • 34
    • 42
    • 50
    • 1
    • 2
    • 3
    • 8
    • 9
    • 12
    • 13
    • 15
    • 18
    • 19
    • 21
    • 25
    • 26
    • 29
    • 30
    • 37
    • 39
    • 44
    • 45
    • 12
    • 21
    • 27
    • 28
    • 43
    • 44
    • 51
    • 53
  • The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
  • The cuDNN static builds load NVRTC dynamically when using the runtime fusion engine.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture. Furthermore, if used, training operations can succeed with a NaN loss for every epoch.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for batch normalization forwards training and batch normalization backwards may obtain incorrect results when batch size is greater than 1 and when channel count is not evenly divisible by 8. These values of CUDNN_ATTR_ENGINE_GLOBAL_INDEX correspond to newly added multi-GPU batch normalization support within cuDNN 8.5. Use of single-GPU batch normalization is unaffected by this issue. cuDNN will be revised to reject incorrectly supported multi-GPU batch normalization problems in a future release.

1.9. cuDNN Release 8.5.0

These are the NVIDIA cuDNN 8.5.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Achieved 30% reduction in library size by removing unused kernels. The current cuDNN 8.5.0 library size is 850 MB down from 1.2 GB compared to the 8.4.x releases.
  • Four new pointwise modes were added:
    • CUDNN_POINTWISE_GELU_APPROX_TANH_FWD and CUDNN_POINTWISE_GELU_APPROX_TANH_BWD, which are used for approximating GELU in the forward and backward pass, respectively.
    • CUDNN_POINTWISE_ERF, which can be used to piecewise create the GELU operator.
    • CUDNN_POINTWISE_IDENTITY, which can be used for explicitly converting between formats.
  • Improved graph API runtime compilation support:
    • Added support for performant adaptive pooling for NHWC layout supporting flexible I/O datatypes. It also supports large tensors with more than 4 trillion elements.
    • Added support for passing host scalars by value as B tensor in the pointwise operations.
    • Added support for generating code for more broadcasting patterns in the pointwise operations.
    • The CPU overhead associated with the subgraph execution has been reduced by 30 - 40%.
    • NVIDIA Ampere Architecture INT8 conv fusion heuristics have been updated to recommend more performant kernel configs for smaller problem sizes.
  • Added support for error reporting in the RNN APIs.
  • When using cuDNN builds against CUDA 11.x with cuBLAS version >= 11.6 U1, all kernels are now guaranteed to be launched in streams whose priorities match the user stream that is set by cudnnSetStream().
  • Documented operation specific constraints for the runtime fusion engine in the newly added Operation Specific Constraints for the Runtime Fusion Engine section.
  • Double precision support for the CTC loss.
  • Added support for Ubuntu 22.04 on x86_64 and AArch64 ARM. For more information, refer to the supported Linux versions of cuDNN section.
  • Added support for CUDA 11.7. For more information, refer to the GPU, CUDA Toolkit, and CUDA Driver Requirements section.
  • Added the cudnnBackendNormFwdPhase_t, cudnnBackendNormMode_t, CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR, CUDNN_BACKEND_OPERATION_NORM_BACKWARD_DESCRIPTOR, cudnnSignalMode_t, CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR, and CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR functions to the Backend API. These new operations help a cuDNN graph communicate and/or synchronize with another cuDNN graph possibly on a peer GPU. For more information, refer to the NVIDIA cuDNN Backend API documentation.
  • Added new data structure cudnnFraction_t to the Backend API. This more precisely describes the size ratio between the I/O images under fractional up/downsampling and adaptive pooling use cases. For more information, refer to the NVIDIA cuDNN Backend API documentation.

Fixed Issues

  • For packed NCHW tensors using the FP16 data-type, cuDNN attempted to run an optimized kernel if the values of N, C, H, and W were even. In cuDNN versions prior to 8.5, it was possible that incorrect values were generated if odd values for N or C were used. Starting in cuDNN 8.5, if an odd value for N or C is specified, cuDNN runs with an unoptimized kernel.
  • cuDNN was not enforcing the CUDNN_ATTR_EXECUTION_PLAN_HANDLE attribute for the CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR. It is now enforced in cuDNN 8.5.0. cudnnBackendFinalize() returns CUDNN_STATUS_BAD_PARAM if the handle attribute is not set.
  • Running depthwise convolutions in NHWC layout with CUDNN_CONVOLUTION mode and batch size >= 8 could produce incorrect results with cuDNN 8.1 and later. This has been fixed in this release.
  • Cases of folding transform which were not supported were erroring out with BAD_PARAM, this has been fixed to return the correct error code of NOT_SUPPORTED.
  • Improved runtime fusion heuristics for INT8 convolution, correcting small problem sizes.
  • Fixed an issue to ensure CUDNN_HEUR_MODE_B redirects to CUDNN_HEUR_MODE_A when unsupported. Frontend version 0.6.3 has been updated with a similar change to redirect CUDNN_HEUR_MODE_B to CUDNN_HEUR_MODE_A in older cuDNN versions when CUDNN_HEUR_MODE_B is not supported.
  • Fixed an issue in the runtime fusion engine where successive broadcasting patterns (for example, scalars broadcasting into vectors, then broadcasting into tensors) are not handled correctly and may produce wrong results.
  • In the build for CUDA 11.x, we fixed a couple of issues where some cuDNN internal streams were not guaranteed to match the priority of the stream set by cudnnSetStream. Now, all internal streams have that guarantee, except in the case of CUDA graph capture mode.
  • It was suggested that users of the static library requiring the best possible convolution performance use whole-archive linking with the cnn_infer and cnn_train static sub-libraries. This is no longer needed, however, this will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.

Performance Results

The following table shows the average speed-up of unique cuDNN 3D convolution calls for each network on V100 and A100 GPUs that satisfies the conditions in Recommended Settings section of the cuDNN Developer Guide. The end-to-end training performance will depend on a number of factors, such as framework overhead, kernel run time, and model architecture type.
Table 9. cuDNN version 8.5.0 compared to 8.4.1
Model Batchsize A100 8.5.0 vs V100 8.4.1 V100 8.5.0 vs V100 8.4.1
FP16 FP32 FP16 FP32
V-Net (3D-Image segmentation) 2 1.1x 2.9x 1.0x 1.0x
8 1.4x 3.4x 1.0x 1.0x
16 1.6x 3.8x 1.0x 1.1x
32 1.8x 3.7x 1.0x 1.0x
3D-UNet (3D-Image Segmentation) 2 2.1x 6.0x 1.0x 1.2x
4 2.1x 5.7x 1.0x 1.4x

Known Issues

  • cudnnNormalizationForwardTraining() does not currently support BFLOAT16. If an input tensor uses BFLOAT16, the API will return BAD_PARAM.
  • With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in the next CUDA Toolkit update.
  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used. This issue will be resolved in a future release.
  • Users of cuDNN's CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING may see CUDNN_STATUS_BAD_PARAM returned for a problem that should otherwise be supported by that choice of algo.
  • cudnnDropoutForward() and cudnnDropoutBackward() will return incorrect results when input or output tensors have overlapping strides.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • In CUDA graph capture mode, CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream().
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.
  • There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
  • There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1025 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1001 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture. Furthermore, if used, training operations can succeed with a NaN loss for every epoch.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Users of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 11000 and 12000 may obtain incorrect results when batch size is greater than 1 and when channel count is not evenly divisible by 8. These values of CUDNN_ATTR_ENGINE_GLOBAL_INDEX correspond to newly added multi-GPU batch normalization support within cuDNN 8.5. Use of single-GPU batch normalization is unaffected by this issue. cuDNN will be revised to reject incorrectly supported multi-GPU batch normalization problems in a future release.

1.10. cuDNN Release 8.4.1

These are the NVIDIA cuDNN 8.4.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Improved runtime subgraph compilation support
    • Added support for CUDNN_BACKEND_OPERATION_RESAMPLE_FWD for the CUDNN_ATTR_RESAMPLE_MODE set to CUDNN_RESAMPLE_AVGPOOL and CUDNN_RESAMPLE_MAXPOOL through the runtime fusion engine. It can achieve up to 3x speed up compared to the legacy cudnnPoolingForward() API. Pointwise fusions to the output of this operation are also supported. Documentation about the patterns supported can be found in the Supported Graph Patterns section.
    • Newly added micro tile sizes for pointwise fusions that provide significantly improved performance on smaller problem sizes.
  • The NVIDIA cuDNN Developer Guide now includes an expanded section on supported patterns of the Graph API. It takes a systematic approach to explain which graph patterns are supported, along with various graphical examples, and details on some of the restrictions.

Fixed Issues

  • A buffer was shared between threads and caused segmentation faults. There was previously no way to have a per-thread buffer to avoid these segmentation faults. The buffer has been moved to the cuDNN handle. Ensure you have a cuDNN handle for each thread because the buffer in the cuDNN handle is only for the use of one thread and cannot be shared between two threads.
  • Fixed operation graph logging under cudnnBackendExecuteGraphVisualize() section upon calling cudnnBackendExecute() on generic fusion patterns. Added logging for CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR and CUDNN_BACKEND_MATMUL_DESCRIPTOR. Fixed logging for pointwise mode to show the enum value name.
  • Users specifying backend engines 58, 1063, 2062, and 4039 using CUDNN_ATTR_ENGINE_GLOBAL_INDEX with 1x1 convolutions and tensors with more than two GB elements (2G) would see CUDNN_STATUS_EXECUTION_FAILED in cuDNN 8.3.x. This issue has been fixed in this release.
  • cuDNN returned CUDNN_STATUS_EXECUTION_FAILED from cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), or cudnnConvolutionBackwardData() when computing convolutions with large spatial dimensions and batch sizes. This issue has been fixed. Such problems instead return CUDNN_STATUS_NOT_SUPPORTED where applicable.

Known Issues

  • A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround has been integrated in this release to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
  • cuDNN is not enforcing the CUDNN_ATTR_EXECUTION_PLAN_HANDLE attribute for the CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR. This issue will be fixed in a future release.
  • If cuDNN 8.4.1 or earlier statically links with libcudart.so from the CUDA Toolkit 11.7 or later, when the LFL feature is activated, the results from cudnnFind*Algo will not be accurate.
  • For packed NCHW tensors using the FP16 datatype, cuDNN attempts to run an optimized kernel if the values of N, C, H, and W are even. In cuDNN versions before 8.4, it is possible that incorrect values are generated if odd values for the strides of N or C are used. This issue will be resolved in a future release.
  • Users of cuDNN's CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING may see CUDNN_STATUS_BAD_PARAM returned for a problem that should otherwise be supported by that choice of algo.
  • cudnnDropoutForward() and cudnnDropoutBackward() will return incorrect results when input or output tensors have overlapping strides.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream(). We recently discovered some issues that break our ability to document exceptions to this clearly.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking with the cnn_infer and cnn_train static sub libraries. This will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1025 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1001 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture. Furthermore, if used, training operations can succeed with a NaN loss for every epoch.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the nondeterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.

1.11. cuDNN Release 8.4.0

These are the NVIDIA cuDNN 8.4.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • API additions
    • Added API and support for the GEN_INDEX capability. CUDNN_POINTWISE_GEN_INDEX returns the position of an element in an input tensor along a given axis. This operation is similar to NumPy’s mesh grid operation as it returns a tensor with the index of all elements calculated according to the specified axis in the original tensor dimensions.
    • Added API and support for the BINARY_SELECT capability. CUDNN_POINTWISE_BINARY_SELECT is similar to the ternary operation and selects between two input elements based on a predicate element.
    • Experimentally supports serialization of execution plans to or from a string representation to enable the user to avoid recompilation of the fusion kernels. This feature only supports the runtime fusion engine currently. Generalized support for additional engines is planned for future releases.
  • Runtime fusion engine improvements
    • Previous versions of the runtime fusion engine only supported a minimum 128-bit alignment for tensors in all the operations. From this release onwards, the minimum alignment requirement has been relaxed down to 32 bit for input tensors in matrix multiplication and convolution for NVIDIA Ampere Architecture GPUs. For output tensors in any operation and input tensors for pointwise operations, the minimum alignment requirement has been relaxed down to 8 bit.
    • Added support for ARM servers.
  • Documentation improvements functions:

Fixed Issues

  • Users of cuDNN’s CUDNN_ATTR_ENGINE_GLOBAL_INDEX when set to 3000 previously could experience a floating point exception when the filter size (filter width * filter height) is greater than or equal to 32. This issue is fixed in this release.
  • Users of cuDNN's CUDNN_ATTR_ENGINE_GLOBAL_INDEX when set to 58, 1063, or 2062 may now use the knob count CUDNN_KNOB_TYPE_WORKSPACE to set the allowable workspace of these engines.
  • The documentation of cudnnNormalizationForwardInference() and cudnnBatchNormalizationForwardInference() has been improved for clarity.
  • Previous versions of cuDNN may produce wrong results when used to compute a matrix multiplication or fusions containing a matrix multiplication on NVIDIA Ampere Architecture based GPUs. This issue has been fixed in this release.

Known Issues

  • Users of cuDNN's CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING may see CUDNN_STATUS_BAD_PARAM returned for a problem that should otherwise be supported by that choice of algo.
  • cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), and cudnnConvolutionBackwardData() may generate illegal memory address errors on the NVIDIA Volta and NVIDIA Turing architectures. This issue existed in previous 8.3 releases as well.
  • cudnnDropoutForward() and cudnnDropoutBackward() will return incorrect results when input or output tensors have overlapping strides.
  • Users specifying backend engines 58, 1063, 2062, and 4039 using CUDNN_ATTR_ENGINE_GLOBAL_INDEX with 1x1 convolutions and tensors with more than two GB elements (2G) will see CUDNN_STATUS_EXECUTION_FAILED in cuDNN 8.3.x.
  • cuDNN may return CUDNN_STATUS_EXECUTION_FAILED from cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), or cudnnConvolutionBackwardData() when computing convolutions with large spatial dimensions and batch sizes. This issue will be addressed in a future release so that such problems will instead return CUDNN_STATUS_NOT_SUPPORTED where applicable.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1025 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine results in incorrect results.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream(). We recently discovered some issues that break our ability to document exceptions to this clearly.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking with the cnn_infer and cnn_train static sub libraries. This will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.
  • Users of cuDNN 8.4.0 may observe a slowdown in the Single Shot Multibox Detector (SSD) model. This will be fixed in a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1001 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture. Furthermore, if used, training operations can succeed with a NaN loss for every epoch.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.

1.12. cuDNN Release 8.3.3

These are the NVIDIA cuDNN 8.3.3 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Various improvements were made in the runtime fusion engine:
    • Added heuristics for convolution + x fusion and matmul + x fusion for NVIDIA Volta and NVIDIA Turing architectures.
    • Updated the heuristics for matmul + x fusion for NVIDIA Ampere Architecture.
    • Small performance improvement for the matmul + x fusion.
    • Compilation time reduction.
  • Improved the performance for NHWC INT8 max pooling.
  • Updated the list of supported enums in the following data type references:
  • Updated and migrated the content from the Best Practices For Using cuDNN 3D Convolutions to the NVIDIA cuDNN Developer Guide. The Best Practices document has been deprecated.

Fixed Issues

The following issues have been fixed in this release:
  • Fixed an issue when fusing pointwise operation with a scalar (that is a [1, 1, 1, 1] shaped tensor) at the output of a matmul or a convolution. When the output is of integer type, the results may be inaccurate or wrong (due to float to INT8 truncation). After the fix, it will properly round to nearest with clamping.
  • Fixed an issue inside the batch norm finalize descriptor where an implementation detail was erroneously logged. Such unexpected access could intermittently cause a segment fault.
  • Convolution batch norm fusion engines invoked through the graph API only worked with cudnnBackendTensor descriptors with dimensions specified in “n,c,g,h,w” format. This has been fixed and cudnnBackendTensors with dimensions specified any of “n,c,h,w” and “n,c,g,h,w” formats can now be passed.
  • Fixed a numerical overflow issue in the computation of softplus activation function in the runtime fusion engine that was resulting in log(exp(x)) being computed as infinity for sufficiently large positive values of x.
  • In previous releases, cudnnTransformFilter() andcudnnTransformTensorEx() could produce wrong values at some pixels when doing a folding transform. This has been fixed in the current release.
  • Documentation has been updated for pooling forward and backward API functions. The documentation now discusses which data types and vectorizations are supported for the tensor descriptor arguments (this information was previously incomplete). For more information, refer to the cudnnPoolingForward() and cudnnPoolingBackward().

Known Issues

  • cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), and cudnnConvolutionBackwardData() may generate illegal memory address errors on the NVIDIA Volta and NVIDIA Turing architectures. This issue existed in previous 8.3 releases as well.
  • cudnnDropoutForward() and cudnnDropoutBackward() will return incorrect results when input or output tensors have overlapping strides.
  • Users specifying backend engines 58, 1063, 2062, and 4039 using CUDNN_ATTR_ENGINE_GLOBAL_INDEX with 1x1 convolutions and tensors with more than two GB elements (2G) will see CUDNN_STATUS_EXECUTION_FAILED in cuDNN 8.3.x.
  • cuDNN may return CUDNN_STATUS_EXECUTION_FAILED from cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), or cudnnConvolutionBackwardData() when computing convolutions with large spatial dimensions and batch sizes. This issue will be addressed in a future release so that such problems will instead return CUDNN_STATUS_NOT_SUPPORTED where applicable.
  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1025 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine results in incorrect results.
  • The documentation for cudnnReorderFilterAndBias() requires corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream(). We recently discovered some issues that break our ability to document exceptions to this clearly.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() enables both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking with the cnn_infer and cnn_train static sub libraries. This will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • CUDNN_ATTR_ENGINE_GLOBAL_INDEX =1001 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture.
  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later. It also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples must be installed in a writable location. If not installed in a writable location, the samples can crash.
  • RNN and multihead attention API calls may exhibit nondeterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This happens when two buffer sizes (16 KB and 4 MB) are used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.

Deprecated Features

The following features are deprecated in cuDNN 8.3.3:
  • We are deprecating the reporting of performance results in the Best Practices For Using cuDNN 3D Convolutions and will instead update these Release Notes if there is anything interesting to report release-over-release. Starting with cuDNN 8.4.0, this section will be removed. For past performance tables, refer to the NVIDIA cuDNN Archives.
  • Updated and migrated the content from the Best Practices For Using cuDNN 3D Convolutions to the NVIDIA cuDNN Developer Guide. The Best Practices document has been deprecated.

1.13. cuDNN Release 8.3.2

This is the NVIDIA cuDNN 8.3.2 Release Notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • In the runtime fusion engine, pointwise fusion for batched matmul was extended to support operations with full tensor in the epilog.

Announcements

Fixed Issues

  • cuDNN multihead attention produces incorrect results in case the postDropout feature is enabled. The issue has been fixed in this release.
  • Running convBiasAct in CUDNN_CROSS_CORRELATION mode could result in incorrect results if the GroupedDirect engine is selected.
  • Documentation has been updated for pooling forward and backward API functions, to update which data types and vectorizations are supported for the tensor descriptor arguments. For more information, refer to the cudnnPoolingForward() and cudnnPoolingBackward().
  • The documentation in the Reproducibility section in the NVIDIA cuDNN Developer Guide has been improved upon for clarity.
  • The documentation for the CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR, CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR, and CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR in the NVIDIA cuDNN API Reference has been improved upon for clarity.
  • Use of CUDNN_TENSOR_NCHW_VECT_C with cudnnReorderFilterAndBias() could generate incorrect results when the reordered filter data was used incorrectly within cuDNN. Direct use of cudnnConvolutionForward() or cudnnConvolutionBiasActivationForward() without cudnnReorderFilterAndBias() was unaffected by this issue.
  • Compared to cuDNN 8.3.0, there was an overall ~5% regression on convBiasAct layers on PG199/PG189. The maximum performance regression was around 3x for a select few cases. This issue has been fixed in this release.

Known Issues

  • cuDNN may return CUDNN_STATUS_EXECUTION_FAILED from cudnnConvolutionForward(), cudnnConvolutionBiasActivationForward(), or cudnnConvolutionBackwardData() when computing convolutions with large spatial dimensions and batch sizes. This issue will be addressed in a future release so that such problems will instead return CUDNN_STATUS_NOT_SUPPORTED where applicable.
  • Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture.
  • Data gradient backendEngine 25 (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine results in incorrect results.
  • The documentation for cudnnReorderFilterAndBias() needs some corrections for clarity.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • CUDA streams internal to cuDNN are not guaranteed to have the same priority as the user stream that is set by cudnnSetStream(). We recently discovered some issues that break our ability to document exceptions to this clearly.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() allows both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking with the cnn_infer and cnn_train static sub libraries. This will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.

Deprecated Features

The following features are deprecated in cuDNN 8.3.2:
  • We are deprecating the reporting of performance results in the Best Practices For Using cuDNN 3D Convolutions and will instead update these Release Notes if there is anything interesting to report release-over-release. Starting with cuDNN 8.4.0, this section will be removed. For past performance tables, refer to the NVIDIA cuDNN Archives > Best Practices For Using cuDNN 3D Convolutions.

1.14. cuDNN Release 8.3.1

This is the NVIDIA cuDNN 8.3.1 Release Notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These Release Notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Announcements

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • In the runtime fusion engine:
    • Pointwise logical and comparison operators are now supported, including CUDNN_POINTWISE_LOGICAL_AND, CUDNN_POINTWISE_LOGICAL_OR, CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_CMP_EQ, CUDNN_POINTWISE_CMP_NEQ, CUDNN_POINTWISE_CMP_GT, CUDNN_POINTWISE_CMP_GE, CUDNN_POINTWISE_CMP_LT, and CUDNN_POINTWISE_CMP_LE. As part of this feature, support for loading/storing/computing with boolean tensors has also been added.
    • Batch support was added for matmul operation. Also, it is allowed to have batch broadcasting. The same matrix A or B can be broadcasted across the batch for matmul operation.
    • The leading dimension support (reflected in the strides of the tensors) was added for matmul operation. It is allowed to compute matmul operation with unpacked tensors.

Fixed Issues

The following issues have been fixed in this release:
  • cudnnConvolutionBiasActivationForward() could in some cases silently apply a ReLU operation when Identity was requested. This issue has been fixed in this release.
  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_FFT_TILING, CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT_TILING, and CUDNN_CONVOLUTION_FWD_ALGO_FFT_TILING could exhibit illegal memory access in cuDNN v8 releases. This issue has been fixed in this release.
  • CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0 was wrongly marked with numerical note CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION when output data type is float or double. This issue has been fixed in this release.
  • There was an error in the documentation for determinism of cudnnConvolutionBackwardFilter() by algo. This issue has been corrected in this release.
  • When the user selected algo0 (CUDNN_RNN_ALGO_STANDARD) in cudnnRNNBackwardData_v8() or invoked the legacy functions, such as cudnnRNNBackwardDataEx(), cudnnRNNBackwardData(), and the number of RNN layers was more than eight in a unidirectional model or more than four in a bidirectional model, then some internal streams used to parallelize computations may be default streams (aka stream 0). The computational performance would most likely be affected in those cases. This issue has been fixed in this release.
  • Calling cudnnSoftmaxForward() with CUDNN_SOFTMAX_MODE_CHANNEL mode and N==1 in NCHW layout would result in incorrect results in cuDNN 8.3.0. This has been fixed in this release.

Known Issues

  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • In general, the internal CUDA streams inside cuDNN will have the same priority as the user stream that is set by cudnnSetStream() (instead of always having default priority). There are two exceptions:
    1. When the user stream is in capture mode (that is, cudaStreamCaptureStatusActive==1), the cuDNN-owned streams will still have default priority, and
    2. RNN functions cudnnRNNForward(), cudnnRNNBackwardData_v8(), cudnnRNNBackwardWeights_v8(), and their legacy counterparts still use default priority CUDA streams or higher priority streams to launch concurrent and cooperative grids.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() allows both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking with the cnn_infer and cnn_train static sub libraries. This will come at a cost to the binary size of the application. This linkage requirement will be relaxed in a future release.
  • Compared to cuDNN 8.3.0, there is an overall ~5% regression on convBiasAct layers on PG199/PG189. The maximum performance regression is around 3x for a select few cases.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.

1.15. cuDNN Release 8.3.0

This is the NVIDIA cuDNN 8.3.0 release notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and NVIDIA JetPack™ users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previously released cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Announcements

  • cuDNN version 8.3.0 depends on cuBLAS as a shared library dependency.
  • The cuDNN version 8.3.0 libcudnn_static.a deliverable is replaced with the following:
    • libcudnn_ops_infer_static.a
    • libcudnn_ops_train_static.a
    • libcudnn_cnn_infer_static.a
    • libcudnn_cnn_train_static.a
    • libcudnn_adv_infer_static.a
    • libcudnn_adv_train_static.a
  • cuDNN version 8.3.0 depends on zlib as a shared library dependency. Refer to the zlib instructions in the NVIDIA cuDNN Installation Guide for instructions.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • WSL 2 is released as a preview feature in this cuDNN 8.3.0.
  • Various improvements were made in our multihead attention API:
    • Added HSH support - FP16 data type with FP32 math precision. Allow to achieve FP16 mixed-precision Tensor Core performance without sacrificing accuracy.
    • Added support to bias gradient computation. Before cuDNN 8.3.0, bias was supported only for inference.
    • Multihead attention has two dropout layers active in training mode. The first dropout operation is applied directly to the softmax output. The second dropout operation alters the multihead attention output, just before the point where residual connections are added. Before cuDNN 8.3.0, only the first dropout layer was supported.
    • Significant performance improvement out of the box (no changes are required from users) for both multihead attention forward and backward paths.
  • Various improvements were made in our runtime fusion engine:
    • The cuDNN runtime fusion engine now supports resample operations of upsample and downsample. Support has been added for 2*2 average pooling with stride 2 and upsample by a factor of 2 using bilinear interpolation. The datatype supported is FP32 and the compute datatype is FP32. The resample operations can also be fused with other operations provided the input to the resample operation is located in global memory.
    • The cuDNN runtime fusion engine is now generalized to accurately obey the intermediate storage datatype users specified in the operation graph. The support datatypes include INT8, BF16, FP16, INT32, FP32. As a general rule, we recommend users to use FP32 as the intermediate storage type that provides balanced numerical precision and performance.
    • The cuDNN runtime fusion engine now supports batched matmul operation with row/column broadcast or row/column reduction operations in the epilog.
    • The cuDNN runtime fusion engine now does numeric clamping while converting from a data type with a larger dynamic range to one with a relatively smaller dynamic range to avoid numeric overflows at all times.
    • The cuDNN runtime fusion engine extends the support for broadcast pointwise operations in the epilogue to now include those between a tensor and a scalar value as well.
    • Extended general fusion heuristic support to convolution forward and backward data and weight gradient operation patterns, with FP16, TF32, INT8 I/O data types, to ensure a good heuristic selection to improve out-of-the-box performance.
  • Added more detailed error reporting that is accessible from the existing API log or the logging callback function. Error and warning severity levels are added into the error reporting. Environment variables CUDNN_LOGERR_DBG and CUDNN_LOGWARN_DBG can be used to enable these severity levels respectively. Within these error severity levels, the error or warning message will now include a traceback of the error conditions triggered the error as hints for troubleshooting purposes.
  • Engine heuristics now supports a new mode called HEUR_MODE_FALLBACK which gives a list of engine configurations that run most of the convolution problems without the performance guarantee. Use this mode when all engines suggested by heuristics are not supported.
  • In prior cuDNN versions, certain engines required reordered filters for int8x32 format, but there was no way to disambiguate whether the filter was reordered. Engines that require reordered filters now have a new behavior note CUDNN_BEHAVIOR_NOTE_REQUIRES_FILTER_REORDER which specifies the tensors must be reordered before being passed to the engine.
  • RNN functions cudnnRNNBackwardData_v8(), cudnnRNNBackwardDataEx(), and cudnnRNNBackwardData()have been improved to internally invoke the cooperative group API cudaLaunchCooperativeKernel() to launch GPU kernels when threads must synchronize across all CUDA® thread blocks of a grid. Starting in CUDA 11.2, the cudaLaunchCooperativeKernel() function is able to run multiple cooperative grids concurrently in multiple streams. This feature has been used in CUDNN_RNN_ALGO_PERSIST_STATIC and CUDNN_RNN_ALGO_PERSIST_DYNAMIC algorithms to improve the computational performance. A method of launching these types of kernels using cudaLaunchCooperativeKernel() is more robust in preventing potential deadlocks when in rare scenarios when multiple cooperative grids are launched concurrently.

    cuDNN 8.3 compiled with CUDA 10.2 must still rely on a regular method of launching kernels. Deadlocks are mitigated by employing higher priority CUDA streams. Currently, cuDNN RNN APIs still use higher priority streams, however, in future cuDNN versions, priorities of auxiliary streams will match the priority of the user stream defined by the cudnnSetStream() call. Future cuDNN versions will also use the cudaLaunchCooperativeKernel() API to launch cooperative grids in forward RNN functions such as cudnnRNNForward().

Fixed Issues

The following issues have been fixed in this release:
  • cudnnAddTensor() did not support some tensor shapes that were previously specified as supported. This issue has been fixed in this release.
  • When using the cuDNN CTC Loss API function, the computed gradients array was not zero initialized. This meant it was possible the gradients array returned a mix of valid values and uninitialized values. This issue has been fixed in this release.
  • Compared to version 8.0.5, legacy convolution APIs increased CPU computational costs. On x86, this was measured to be as high as 10 microseconds. This issue has been fixed in this release.
  • cudnnSetStream() API was generating errors when graph capture was enabled. This issue is fixed in this release.
  • There was a known 60% performance regression for ResNet-50 on the GTX 1080 when run using FP16 data with large batch sizes (over 128). This regression has been fixed in this release.
  • In some cases, cudnnConvolutionBackwardFilter() generates numerically imprecise results when used with algo set to CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1. This issue is most frequently encountered with three-dimensional spatial tensors. Users of the backend API may explicitly avoid backend engine 2032 or consider the numerical notes of engines and reject any marked as offering reduced precision reduction (CUDNN_NUMERICAL_NOTE_REDUCED_PRECISION_REDUCTION).
  • For cudnnConvolutionBiasActivationForward(), there was previously a restriction on aliasing device memory pointers labeled X and Z in the documentation of that function. This restriction has been relaxed so that X may alias Z by pointing to the same device memory location if desired. Note that the restriction against aliasing the pointers labeled X and Y remains.
  • cuDNN may be observed to contain a small leak related to the use of dlopen. Currently, this is believed to be a false positive when indicated by valgrind. Should this thinking change, the known issues of this document will reflect that understanding in subsequent releases.
  • Previously, on NVIDIA Pascal and Maxwell architectures, users of cuDNN's 8.X's backend engine 34 with CUDNN_CONVOLUTION mode set for forward convolution could witness-illegal memory access when this engine is specifically selected outside of heuristic query. This issue has been fixed in this release.
  • Previously, on K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half I/O data types a silent error might occur when the output width Q is 1 and both height and width padding are zero; this particular case will now be rejected by cuDNN as not supported in this and all other successive releases for this GPU architecture.
  • cuDNN does not package libfreeimg as a static library for users of cuDNN's MNIST sample code. The included readme.txt file contains instructions on where to locate this dependency and how to compile and link this sample.
  • The parameters section for cudnnBatchNormalizationForwardInference() has been updated to reflect a correct *y description.
  • Compared to cuDNN 7.6.5, there was a known performance regression on various convolutional models using INT8 data types on NVIDIA Volta GPUs. This issue has been fixed in this release.
  • A memory leak as well as a possible delayed memory deallocation in the cuDNN runtime fusion engine have been fixed.
  • In previous releases of cuDNN 8, user applications might crash in rare instances due to large stack allocation requirements; this issue is fixed in this release by preferring heap allocation in cases where large stack allocations were previously occurring.
  • Some dgrad batchnorm fusion engines were previously not supported on Windows. We now support this starting in cuDNN 8.3.0.
  • The documentation for cudnnReorderFilterAndBias() needed some corrections for clarity. The topic has been updated in this release.

Known Issues

  • When the user selects algo0 (CUDNN_RNN_ALGO_STANDARD) in cudnnRNNBackwardData_v8() or invokes the legacy functions, such as cudnnRNNBackwardDataEx(), cudnnRNNBackwardData(), and the number of RNN layers is more than eight in a unidirectional model or more than four in a bidirectional model, then some internal streams used to parallelize computations may be default streams (aka stream 0). RNN algo0 dgrad APIs will not crash and the numerical results will be correct but the computational performance will likely be affected in those cases.
  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size that will require resolution in future releases, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • The internal CUDA streams inside cuDNN 8.3.0 will have the same priority as the user stream that is set by cudnnSetStream() (instead of always having default priority). There are two limitations:
    1. When the user stream is in capture mode (that is, cudaStreamCaptureStatusActive==1), the cuDNN-owned streams will still have default priority, and
    2. RNN functions cudnnRNNForward(), cudnnRNNBackwardData_v8(), cudnnRNNBackwardWeights_v8(), and their legacy counterparts still use default priority CUDA streams or higher priority streams to launch concurrent and cooperative grids.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
  • cudnnPoolingBackward() allows both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.
  • When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
  • Calling cudnnSoftmaxForward() with CUDNN_SOFTMAX_MODE_CHANNEL mode and N==1 in NCHW layout may result in incorrect results. This will be fixed in the next release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory that might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
  • The spatial persistent batch normalization API is only available for NVIDIA Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
  • cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.

1.16. cuDNN Release 8.2.4

This is the cuDNN 8.2.4 release notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and NVIDIA JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size that will require resolution in future releases, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all tensor shapes even though the cuDNN documentation says otherwise.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • Compared to cuDNN 7.6.5, there are known performance regressions on various convolutional models using INT8 data types on NVIDIA Volta GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • The documentation for cudnnReorderFilterAndBias() needs some corrections for clarity.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • NVIDIA Turing users of cuDNN can observe intermittent illegal memory access errors for some convolution workloads.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • In a multi-GPU setting, with complex scheduling, cuDNN can segfault. It is not clear that this is a cuDNN issue, but the issue is under active investigation so that the known limitations section of this document can be updated in a future release.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • There is a known 60% performance regression for ResNet-50 on the GTX 1080 when run using FP16 data with large batch sizes (over 128).
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking. This will come at a cost to the binary size that will require resolution in a future release, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • cuDNN may contain a small memory leak related to the usage of dlopen() within the library; this is not confirmed but currently under investigation. The possible leak does not affect Windows users or users of the static library.
  • On NVIDIA Pascal and Maxwell architectures, users of cuDNN's 8.0 backend engine 34 for forward convolution can witness-illegal memory access when this engine is specifically selected outside of heuristic query. Heuristics users of these architectures will not witness this issue, as happened in previous versions. The possibility of the illegal memory access affects all previous versions of cuDNN 8.0 and will be fixed in a future release.
  • When using the cuDNN CTC Loss API function, the computed gradients array is not zero initialized. When sequence lengths are exceeded, some gradient entries are returned uninitialized.
  • The internal CUDA streams inside cuDNN 8.2.4 will have the same priority (instead of the default priority) as the user stream that is set by cudnnSetStream(), while an exception/limitation is that they will have priority as (highest - 1) for the user stream with the highest priority. This is true only when the user stream is NOT in capture mode (cudaStreamCaptureStatusActive), otherwise the behavior does not change.
  • Fusion engine operation mode 16 is not currently supported on Windows; this will be fixed in a future release.
  • The functional support criteria of cuDNN's convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero, however, is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • On K80 GPUs, when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half I/O data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory that might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
  • Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.

1.17. cuDNN Release 8.2.2

This is the cuDNN 8.2.2 release notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and NVIDIA JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Experimental runtime fusion heuristics are now supported to facilitate an intelligent and efficient heuristic recommendation based on the predicted execution time for the runtime fusion engine. The current coverage is limited to fusion patterns involving a convolution forward operation in FP16 mixed precision configuration on NVIDIA Ampere Architecture GPUs. We will continue to expand the support and improve the prediction accuracy in future releases.
  • The cuDNN runtime fusion now supports pure pointwise fusion or pointwise plus reduction fusion. It supports FP16/FP32 as I/O type and FP32 as compute type.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, see the NVIDIA cuDNN Support Matrix.

Fixed Issues

The following issues have been fixed in this release:
  • For platforms that ship a compiler version older than GCC 6 by default, linking to static cuDNN using the default compiler is not supported.
  • There was a 15% performance regression for inference on the PyTorch WaveGlow model on the NVIDIA Turing architecture. This regression has been fixed.
  • The convolve_common_engine_int8_NHWC kernel had an undesired FP32 > INT32 truncation before outputting the FP32 result directly. This issue has been fixed in this release.
  • In previous cuDNN versions, cudnnRNNBackwardData(), cudnnRNNBackwardDataEx(), or cudnnRNNBackwardData_v8() could return CUDNN_STATUS_INTERNAL_ERROR when CUDNN_RNN_ALGO_PERSIST_STATIC and CUDNN_LSTM were selected. This issue occurred mainly on smaller GPUs, such as NVIDIA Turing with 36 or 48 SMs and smaller hiddenSize values. This issue has been fixed in this release.
  • NVIDIA Turing GTX 16xx users of cuDNN would observe invalid values in convolution output. This issue has been fixed in this release.
  • Users would experience NCHW transformations causing a floating point exception and the CPU reference code producing incorrect results for tensor format CUDNN_TENSOR_NCHW_VECT_C. Corner cases in the convolution sample code have been fixed in this release.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size that will require resolution in future releases, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all tensor shapes even though the cuDNN documentation says otherwise.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • Compared to cuDNN 7.6.5, there are known performance regressions on various convolutional models using INT8 data types on NVIDIA Volta GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN 7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • The documentation for cudnnReorderFilterAndBias() needs some corrections for clarity.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • NVIDIA Turing users of cuDNN can observe intermittent illegal memory access errors for some convolution workloads.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • In a multi-GPU setting, with complex scheduling, cuDNN can segfault. It is not clear that this is a cuDNN issue, but the issue is under active investigation so that the known limitations section of this document can be updated in a future release.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • There is a known 60% performance regression for ResNet-50 on the GTX 1080 when run using FP16 data with large batch sizes (over 128).
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking. This will come at a cost to the binary size that will require resolution in a future release, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • cuDNN may contain a small memory leak related to the usage of dlopen() within the library; this is not confirmed but currently under investigation. The possible leak does not affect Windows users or users of the static library.
  • On NVIDIA Pascal and Maxwell architectures, users of cuDNN's 8.0 backend engine 34 for forward convolution can witness-illegal memory access; this affects all previous versions of cuDNN 8.0 and will be fixed in a future release.
  • When using the cuDNN CTC Loss API function, the computed gradients array is not zero initialized. When sequence lengths are exceeded, some gradient entries are returned uninitialized.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • On K80 GPUs, when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half I/O data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory that might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.

Deprecated Features

The following features are deprecated in cuDNN 8.2.2:
  • Support for Ubuntu 16.04 has been deprecated in cuDNN 8.2.2 for CUDA 11.4. For a list of supported OS, refer to the NVIDIA cuDNN Support Matrix.
  • Support for RHEL7 for ppc64le has been deprecated in cuDNN 8.2.2 for CUDA 11.4. For a list of supported OS, refer to the NVIDIA cuDNN Support Matrix.

1.18. cuDNN Release 8.2.1

This is the cuDNN 8.2.1 release notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and NVIDIA JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, see the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • The cuDNN runtime fusion engine now supports generating Tensor Core kernels with input tensors of:
    • Bfloat16 type and compute precision of FP32 (requires compute capability 8.0 or later). For Bfloat16 support, convolution I/O channels are required to be a multiple of 8.
    • INT8 and compute precision of INT32 (requires compute capability 7.5 or later) datatype and in NHWC layout. For INT8 support, the convolution I/O channels are required to be a multiple of 16, and unlike the NCHW_VECT_C kernels, filter and bias reordering is not required.

    In the fused pointwise/reduction operations, FP32 is the compute precision supported.

  • The cuDNN runtime fusion engine has added experimental Tensor Core kernel generation support for NVIDIA Volta (compute capability 7.0) and NVIDIA Xavier (compute capability 7.2). The supported input tensor data type is FP16, compute precision is FP32, and the supported layout is NHWC. However, reduction fusion is not yet supported and we are working on further generalizing the support.
  • Equations in the documentation are now supported in Chrome.
  • The backend API now supports fused convolution-scale-bias-activation with per-channel-scaling by matching the operation graph.
  • cudnnPoolingBackward() allows both x and y data pointers (together with the related tensor descriptor handles) to be NULL for avg-pooling. This could save memory footprint and bandwidth.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Fixed Issues

The following issues have been fixed in this release:
  • In some cases, NVIDIA Ampere Architecture users of cuDNN 8.1 cudnnGetConvolutionBackwardFilterAlgorithm_v7() could receive a workspace that was insufficient for computing the calculation with cudnnConvolutionBackwardFilter(). This issue has been fixed in this release.
  • Many convolution models were experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This included ResNet-50 with up to 2x performance difference and ResNeXt up to 10x the performance difference. Many of these performance issues have been fixed in this release.
  • Compared to cuDNN version 8.0.5, there was a known 8% performance regression on the SSD ResNet-50 model on the NVIDIA Ampere Architecture. This issue has been fixed in this release.
  • L4T users of cuDNN could observe CUDNN_STATUS_EXECUTION_FAILED errors in some cases when performing convolutions using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM. This issue has been fixed in this release.
  • In cuDNN 8.2.0, if the user runs a Bi-directional RNN network with dropout enabled, the user may see non-deterministic outputs. This issue has been fixed in 8.2.1.
  • There was a known 18% performance regression for inference on the PyTorch ResNet-50 v1.5 model on the NVIDIA Turing architecture. This issue has been fixed in this release.
  • Known regressions on certain layers in cuDNN 8 regression in algorithm selection heuristics have been fixed on NVIDIA Volta and NVIDIA Pascal platforms.
  • In older versions of cuDNN, when calling the API cudnnSetDropoutDescriptor(), a kernel launched by this API used to require a substantial amount of GPU memory for the stack. The memory is released when the kernel finishes and the stack size is changed back in a way that is not thread safe. Starting in the 8.2.1 release, the extra memory is no longer required, and as a result, the thread safety concern is no longer present.
  • In cuDNN 8.1.1, compared to cuDNN 8.1.0, there was a known regression in performance of the runtime fusion engine for convolution fused with ReLU in the epilog. This was caused due to the generalized support for parameterized ReLU. This issue has been fixed since the 8.2.0 release.
  • Since cuDNN 8.0.4 until 8.2.0, certain SKUs of V100 GPU may encounter CUDNN_STATUS_EXECUTION_FAILED status or unspecified launch failure in a subsequent call to cudaDeviceSynchronize() when running RNN with cell mode of CUDNN_LSTM and CUDNN_RNN_ALGO_PERSIST_STATIC algorithm. This issue has been fixed in this release.
  • Between cuDNN 8.1.0 and 8.2.0, if the user runs cudnnRNN*() API under CUDA compute sanitizer with CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H algorithm, users may see errors like Invalid __global__ read reported by the CUDA compute sanitizer. This issue has been fixed in this release.
  • Compared to cuDNN 8.0.0 preview, there is a known ~12% performance regression on vgg16 when run on Jetson Nano and TX2. This issue has been fixed in this release.
  • Compared to cuDNN 7.6, there is a significant performance regression on Darknet when run on Jetson Nano. This issue has been fixed in this release.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size that will require resolution in future releases, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • Some convolution models are experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all tensor shapes even though the cuDNN documentation says otherwise.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • Compared to cuDNN 7.6.5, there are known performance regressions on various convolutional models using INT8 data types on NVIDIA Volta GPUs.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
  • It is possible, starting in cuDNN v7.6 and up to but not including 8.1.1, to leak memory when computing common convolution operations in rare cases.
  • There is a known 15% performance regression for inference on the PyTorch WaveGlow model on the NVIDIA Turing architecture.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • The documentation for cudnnReorderFilterAndBias() needs some corrections for clarity.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
  • NVIDIA Turing GTX 16xx users of cuDNN can observe invalid values in convolution output.
  • NVIDIA Turing users of cuDNN can observe intermittent illegal memory access errors for some convolution workloads.
  • FFT and Winograd based algorithms for convolution do not support graph capture.
  • In a multi-GPU setting, with complex scheduling, cuDNN can segfault. It is not clear that this is a cuDNN issue, but the issue is under active investigation so that the known limitations section of this document can be updated in a future release.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case.
  • There is a known 60% performance regression for ResNet-50 on the GTX 1080 when run using FP16 data with large batch sizes (over 128).

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • On K80 GPUs, when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half I/O data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory that might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.
  • Users of cuDNN must add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.

Deprecated Features

The following features are deprecated in cuDNN 8.2.1:
  • Support for Ubuntu 16.04 will be deprecated in cuDNN 8.2.2 for CUDA 11.4. For a list of supported OS, refer to the NVIDIA cuDNN Support Matrix.
  • Support for RHEL7 for ppc64le will be deprecated in cuDNN 8.2.2 for CUDA 11.4. For a list of supported OS, refer to the NVIDIA cuDNN Support Matrix.

1.19. cuDNN Release 8.2.0

This is the cuDNN 8.2.0 release notes. This release includes fixes from the previous cuDNN v8.1.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and NVIDIA JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • Convolution with the backend API now supports tensor with more than 2**31 elements. The size and stride of each tensor dimension are still limited to 32-bit values.
  • Convolution Heuristics Generalization has been improved for several GPUs. These improvements can be observed in the legacy API and version 8 API. In the version 8 API, these improvements are available in both CUDNN_HEUR_MODE_INSTANT and CUDNN_HEUR_MODE_B.
  • The cuDNN runtime fusion engine now supports generating Tensor Core based fusion kernels in the following scenarios:
    • When there is a scale+bias+ReLU pattern in the graph fused to the x input of a convolution forward operation
    • When the graph contains 3D convolution forward, backward data, or backward filter operation
    • When the graph contains a convolution backward data operation with non-unit convolution strides

    We are working on further generalization of this support.

  • cuDNN C++ frontend has released the 0.2 version that adds more general support to activation forward and backward operations, matMul operation, and contains various bug fixes and clean ups. A few runtime fusion samples have also been added. For more information, refer to GitHub: cuDNN frontend.
  • The new RNN ALGO_STANDARD implementation and heuristics tuning provides significant speedup (up to 100%), especially when the overall problem size is small (hidden size, batch size, and the number of timesteps).
  • The RNN dropout implementation has been heavily optimized. The new implementation brings significant speed-up to all RNN algorithms when dropout is enabled.
  • cuDNN RNN has moved to calling cuBLASLt on newer architectures (compute capability >= 7.0). As a result, the CUBLAS_WORKSPACE_CONFIG workaround for cuBLAS non-deterministic behavior is no longer needed on those architectures. In addition, under repeated CUDA graph capture, cuBLASLt no longer allocates workspace repeatedly like cuBLAS.
  • In the cuDNN v8 backend API, a new CUDNN_ATTR_ENGINE_BEHAVIOR_NOTE attribute has been added. Users can query the engine behaviors using this attribute similar to the numerical behaviors queried through the CUDNN_ATTR_ENGINE_NUMERICAL_NOTE attribute. Currently, the engine behavior note only shows whether an engine does runtime compilation or not. More behaviors may be added in future releases.
  • cuDNN API logging for the v8 backend API has been significantly improved. Now more detailed information can be printed from the backend data structures, for example, tensors, operations, engines, and execution plans. We hope this can improve the development and debugging experience of cuDNN. Refer to this link for more instructions of how to enable API logging.
  • cuDNN now supports SWISH activation in both forward and backward directions. It can be configured for use with cudnnActivationForward() and cudnnActivationBackward() by using CUDNN_ACTIVATION_SWISH with cudnnSetActivationDescriptor(). SWISH activation's parameter, commonly known as beta, may further be set using the newly added API function cudnnSetActivationDescriptorSwishBeta() and queried with cudnnGetActivationDescriptorSwishBeta().

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Fixed Issues

The following issues have been fixed in this release:
  • There was a performance regression in certain use cases comparing NVIDIA RTX 3090 using cuDNN version 8.x to NVIDIA RTX 2080 Ti using cuDNN version 7.x. This regression has been fixed in this release.
  • There was a performance regression in the runtime engine for convolution fused with ReLU in the epilog in cuDNN 8.1.1. This regression has been fixed in this release.
  • Compared to cuDNN version 7.6.5, there was a performance regression in certain grouped ConvolutionBackwardFilter cases on the NVIDIA Volta GPU architecture. This regression has been fixed in this release.
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT returned an internal error when the number of channels in the filter was greater than or equal to 65536. This issue has been fixed in this release.
  • Compared to cuDNN version 8.0.2, there was a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case. This issue has been fixed in this release.
  • Although the overall cuDNN library size has improved in cuDNN 8.1.0 with CUDA Toolkit 11.2 and greater, as compared to cuDNN 8.0.x, the cuDNN library remains large. We have attempted to moderate the severity of this issue in this release.
  • Many convolution models were experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This included ResNet-50 with up to 2x performance difference, ResNeXt up to 10x performance difference and U-Net up to 3x performance difference. The performance issues have been fixed in this release.
  • The ResNet-50 native FP32 inference issues have been fixed on NVIDIA Volta, NVIDIA Turing, and NVIDIA Ampere Architecture GPUs.
  • We have eliminated anonymous structs in cuDNN public headers cudnn_cnn_infer.h, cudnn_cnn_train.h, and cudnn_ops_infer.h to allow forward struct declarations. The following five typedef-s were updated: cudnnConvolutionFwdAlgoPerf_t, cudnnConvolutionBwdDataAlgoPerf_t, cudnnConvolutionBwdFilterAlgoPerf_t, cudnnAlgorithm_t, and cudnnDebug_t
  • cudnnActivationForward() could generate illegal memory access errors for tensors of more than 2**30 elements in the previous version of cuDNN 8. This issue has been fixed in this release.
  • In previous releases, cudnnRNNBackwardWeights(), cudnnRNNBackwardWeightsEx(), and cudnnRNNBackwardWeights_v8() may generate wrong and non-deterministic results when dropout is enabled. A stream dependency issue has been fixed in the current release so users will no longer observe this issue.
  • The heuristics in cudnnConvolutionBackwardFilter() have been improved for generalized cases. We have observed several convolution cases with up to ~100x performance improvements compared to cuDNN version 8.1.
  • Compared to cuDNN 8.0.4, there was a known ~6% performance regression on ONNX-WaveGlow when run on NVIDIA TITAN RTX. This issue has been fixed in this release.
  • Compared to cuDNN 7.6, there were known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs. This issue has been fixed in this release.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size that will require resolution in future releases, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all broadcast-able tensor shapes even though the cuDNN documentation says otherwise.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Compared to cuDNN 8.0.0 Preview, there is a known ~12% performance regression on vgg16 when run on Jetson Nano and TX2.
  • Compared to cuDNN 8.0.4, there is a known ~6% performance regression on ONNX-WaveGlow when run on NVIDIA TITAN RTX.
  • Compared to cuDNN 7.6, there is a significant performance regression on Darknet when run on Jetson Nano.
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • L4T users of cuDNN may observe CUDNN_STATUS_EXECUTION_FAILED errors in some cases when performing convolutions using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM. This issue is being investigated.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking. This will come at a cost to the binary size that will require resolution in a future release, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • Compared to cuDNN 7.6.5, there are known performance regressions on various convolutional models using INT8 data types on NVIDIA Volta GPUs.
  • Compared to cuDNN 8.1.0, there is a known regression in performance of the runtime fusion engine for convolution fused with ReLU in the epilog. This is caused due to the generalized support for parameterized ReLU. Further optimizations are being worked on.
  • The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet. This is being worked on and will be resolved in a future release.
  • It is possible, starting in cuDNN v7.6, to leak memory when computing common convolution operations in rare cases.
  • There is a known 15% performance regression for inference on the PyTorch WaveGlow model on the NVIDIA Turing architecture.
  • There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
  • There is a known 18% performance regression for inference on the PyTorch ResNet-50 v1.5 model on the NVIDIA Turing architecture.
  • The documentation for cudnnReorderFilterAndBias() needs some corrections for clarity. This will be fixed in a future release.
  • Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.

Limitations

  • The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • On K80 GPUs, when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half I/O data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory that might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.

1.20. cuDNN Release 8.1.1

This is the cuDNN 8.1.1 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and NVIDIA JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • The runtime fusion engine now supports the canonical NCHW/KCRS/NKPQ format for describing a tensor, in addition to the version 8 format that has the explicit group dimension NGCHW/GKCRS/NGKPQ that is already supported.
  • The runtime fusion engine now supports NVIDIA Ampere Architecture cards with compute capability 86 (that is, GA10x) in addition to compute capability 80 (GA100) and compute capability 75 (Tu10x).
  • The runtime fusion engine fully supports fusing configurable ReLU (a generalization of ReLU, clipped ReLU, and leaky ReLU), Tanh, Sigmoid, configurable EluGelu, configurable Softplus, and configurable Swish forward and backward activations into the epilog of a convolution forward, a convolution backward data, or a matrix multiplication operation.
  • The runtime fusion engine now fully supports [N, H, W, C] to [1, 1, 1, C] reduction and [N, H, W, C] to [N, H, W, 1] reduction on the output of a convolution forward, a convolution backward data operation, and [1, M, N] to [1, M, 1] and [1, M, N] to [1, 1, N] reduction in matrix multiplication operations. For convolution backward filter operation, [N, H, W, C] to [1, H, W, C] reduction and [N, H, W, C] to> [N, 1, 1, 1]reduction are supported. The supported reduction operators are CUDNN_REDUCE_TENSOR_ADD, CUDNN_REDUCE_TENSOR_MIN, and CUDNN_REDUCE_TENSOR_MAX.
  • API logging in cudnnBackendExecute() has been greatly improved to print out the internal information in descriptors like operation graphs, engines, and execution plans.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Fixed Issues

The following issues have been fixed in this release:
  • In some cases, cudnnConvolutionBackwardData(), on the NVIDIA Turing and NVIDIA Volta architectures performs operations on the GPU resulting in illegal memory access. This issue was fixed in version 8.0 and subsequent releases.
  • When running a convolution forward, convolution backward data/weights, or a matrix multiplication fusion with pointwise and reduction operations with engine index 0, the runtime fusion engine used to be allowed to run on the CUDA Toolkit 10.2. However, not all the features it relies upon are supported in the CUDA Toolkit 10.2. For better stability and targeted optimizations, the engine now requires CUDA Toolkit 11.2 update 1. We have blocked the engine from running in cuDNN built against CUDA Toolkit 10.2. See the Limitations section for more details.
  • The supported check in the runtime fusion engine has been improved to return proper error codes in currently unsupported operation graphs, such as:
    • an operation graph that contains more than one convolution of matrix multiplication operations
    • convolutions with compute type that is not FP32
    • grouped convolutions
    • when the convolution mode is not CUDNN_CROSS_CORRELATION
  • When running a convolution backward data/weights fusion with pointwise and reduction operations with engine index 0, the runtime fusion engine may be launching kernel with more shared memory specified than necessary, causing sub-optimal performance. This issue has been fixed in this release.
  • Execution of a plan for convolution forward operation graph, with engine-global index 1 returned CUDNN_STATUS_INTERNAL_ERROR when the filter format is NHWC and padding was larger than zero. This issue has been fixed in this release.
  • Compared to cuDNN version 8.0.5, there was a known performance regression of 10-50% on Tacotron2 and WaveGlow models. This issue has been fixed in this release.
  • cudnnConvolutionBiasActivationForward() does not invoke TF32 Tensor Core kernels when the math type in the convolution descriptor is set to CUDNN_DEFAULT_MATH. This leads to suboptimal performance under this math mode. This issue has been fixed in this release.
  • Fixed an issue where the version 8 graph API’s execution plan descriptor may internally refer to a descriptor outside of the data structure, which can cause unexpected errors when the external descriptors have been destroyed. Now all the information is recorded within the data structure.
  • There was a performance regression where NHWC was slower than NCHW on 3D convolution up to 40% on V100 and NVIDIA A100 GPUs. This issue has been fixed in this release.
  • There was a performance regression in certain use cases comparing NVIDIA RTX 3090 using cuDNN version 8.x to NVIDIA RTX 2080 Ti using cuDNN version 7.x. This regression has been fixed in this release.
  • Compared to cuDNN version 7.6, there were known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Volta and NVIDIA Ampere Architecture GPUs. These regressions has been fixed in this release.
  • When calling: the API will crash when called with CUDNN_RNN_ALGO_PERSIST_DYNAMIC algo but the cudnnPersistentRNNPlan_t was not created. This has been fixed in this release.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size that will require resolution in future releases, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • The ResNet-50 native FP32 inference issues have been fixed on NVIDIA Volta and NVIDIA Turing. Few performance regressions exist in the NVIDIA Ampere Architecture GPUs.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all broadcast-able tensor shapes even though the cuDNN documentation says otherwise.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Compared to cuDNN 8.0.0 Preview, there is a known ~12% performance regression on vgg16 when run on Jetson Nano and TX2.
  • Compared to cuDNN 8.0.4, there is a known ~6% performance regression on ONNX-WaveGlow when run on NVIDIA TITAN RTX.
  • Compared to cuDNN 7.6, there is a significant performance regression on Darknet when run on Jetson Nano.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case. We are not aware of any popular model that uses this unique use case.
  • Although the overall cuDNN library size has improved in cuDNN 8.1.0 with CUDA Toolkit 11.2 and greater as compared to cuDNN 8.0.x, the cuDNN library remains large; future releases will attempt to moderate the severity of this issue.
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT returns an internal error when the number of channels in the filter is greater than or equal to 65536.
  • Execution of a plan for convolution forward operation graph, with engine-global index 1 returns CUDNN_STATUS_INTERNAL_ERROR when the filter format in NHWC and padding is larger than zero.
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing GPUs.
  • cudnnActivationForward() can cause an illegal memory access CUDA error for tensors with more than 2**30 elements.
  • L4T users of cuDNN may observe CUDNN_STATUS_EXECUTION_FAILED errors in some cases when performing convolutions using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM. This issue is being investigated.
  • Users of the static library requiring the best possible convolution performance should use whole-archive linking. This will come at a cost to the binary size that will require resolution in a future release, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
  • Compared to cuDNN 7.6.5, there are known performance regressions on various convolutional models using INT8 data types on NVIDIA Volta GPUs.
  • Compared to cuDNN 8.1.0, there is a known regression in performance of the runtime fusion engine for convolution fused with ReLU in the epilog. This is caused due to the generalized support for parameterized ReLU. Further optimizations are being worked on.

Limitations

  • The runtime fusion engine was only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1; it now requires the NVRTC from CUDA 11.2 update 1. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • On K80 GPUs when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half I/O data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory that might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.

1.21. cuDNN Release 8.1.0

This is the cuDNN 8.1.0 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and NVIDIA JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • A preview of the cuDNN runtime operation fusion capabilities is included in this release. This feature is exposed as a new backend engine in the version 8.0 graph API. With runtime op fusion, the engine can generate and compile fused tensor-core kernels on the fly for the specified operation graph during the execution plan finalization stage. Some of the operation graph patterns supported in this preview are: convolution or matrix multiplication operation with arbitrary combination of one or more pointwise operations, and reduction operations fused onto the output tensor. Examples include but are not limited to conv-bias-leaky_relu, and gemm-bias-gelu. This feature is supported on GPUs with compute capability 7.5 and 8.0. The current implementation supports FP16 I/O with FP32 compute or FP32 (TF32) I/O with FP32 compute. In this release, the support for this feature is restricted to Linux on x86-64. We will continue to work on this feature to provide additional support and improved performance. In the meantime, we welcome your feedback. E-mail:cudnn@nvidia.com
  • We have released our C++ frontend via GitHub which implements a series of classes wrapping around the v8 backend C API. The user must include a few headers to enjoy the convenience from graph construction, heuristics query to execution. The frontend also implements a significantly improved autotuning feature that can accurately time the executions from a list of functionally equivalent implementations and return the fastest implementation.
  • Heuristics have been improved for TF32 and PSEUDO_HALF (with Tensor Core enabled) convolutions. On one known model, performance was improved 1.3x (when not auto-tuning). On select cases in several models, we have seen performance improvements up to ~50x.
  • Added support for PSEUDO_BFLOAT16_CONFIG on NVIDIA Ampere Architecture GPU for CNNs. While most of the algos/engines that are available for PSEUDO_HALF_CONFIG are available for PSEUDO_BFLOAT16_CONFIG, a few are not available. The available engines for PSEUDO_BFLOAT16_CONFIG can achieve at least 90% performance of PSEUDO_HALF_CONFIG for layers in standard models. There is a known limitation for layers having three or four channels for the filter and convolution of stride 2, such as the first layer of ResNet.
  • EfficientNet performances have improved. Depthwise convolution is now optimized in NHWC layout in cuDNN 8.1.0. From EfficientNet, we see an average of 2.9x speed-up for 5x5 layers, and 1.7x speed-up for 3x3 layers.
  • Added support for TF32 engines to compute operation graphs that match the fused conv-bias-activation pattern. TF32 kernels are also supported in cudnnConvolutionBiasActivationForward() API.
  • Added support for a new RNN algo CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H, specialized for small hidden sizes. It is expected to be faster than other algos for those small hidden sizes.
  • The cuDNN build against CUDA Toolkit 11.2 is now backward compatible with earlier CUDA 11 drivers, including 450, 455, in addition to the 460 driver.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Fixed Issues

The following issues have been fixed in this release:
  • Kernel calc_bias_diff_nhwc_packedhas a known functional issue from 8.0.2. It happens when running cudnnConvolutionBackwardBias(bgrad) with NHWC/NDHWC packed tensors and even C && (C >= 6). This issue was fixed in this release.
  • Kernel convolve_common_engine_int8 may cause accuracy degradation when running cudnnConvolutionBiasActivationForward() with INT8 in cuDNN version 8.0.5 because there was not a rounding when converting the results from single to INT8. This issue was fixed in this release.
  • On some NVIDIA Turing GPUs, when users are running persistent RNN with hiddenSize greater than or equal to 768 but less than 1024, users may get incorrect results and refer to CUDA error 719, cudaErrorLaunchFailure the next time they call cudaDeviceSynchronize(). This bug has been fixed in this release.
  • Calling cudnnConvolutionBiasActivationForward() or executing a cuDNN backend plan for fused convolution-bias-activation operation graphs, can lead to a memory leak. This issue is fixed in the current release.
  • On Windows, calling API cudnnGetFoldedConvBackwardDataDescriptors() results in failure to find symbols. This issue has been present in all versions since cuDNN version 8.0.0 and is fixed in this release.
  • The backend convolution operation had external dependencies on the user created backend tensor descriptors even after finalization. Deletion of the tensor descriptors might cause the operation to seg-fault when constructing the operation graph. This bug affects all versions since cuDNN version 8.0.0, and has been fixed in 8.1.0.
  • Many convolution models were experiencing lower performance on NVIDIA RTX 3090 compared to 2080 Ti. This included ResNet-50 with up to 2x performance difference, ResNeXt up to 10x performance difference and U-Net up to 3x performance difference. The performance issues have been fixed in this release.

Known Issues

  • Users of the static library requiring best possible convolution performance should use whole-archive linking. This will come at a cost to binary size that will require resolution in future releases, either through static sub libraries or relaxing the whole-archive linkage requirement altogether.
  • The ResNet-50 native FP32 inference issues have been fixed on NVIDIA Volta and NVIDIA Turing. Few performance regressions exist in the NVIDIA Ampere Architecture GPU.
  • Compared to version 8.0.5, legacy convolution APIs have increased CPU computational costs. On x86, this has been measured to be as high as 10 microseconds.
  • cudnnAddTensor() does not support all broadcast-able tensor shapes even though the cuDNN documentation says otherwise.
  • Users have reported that in RNN training with non-zero dropout rate, and if the RNN network is unidirectional, the output of cudnnRNNBackwardWeights() may be non-deterministic. We are still investigating this issue.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Compared to cuDNN 8.0.0 Preview, there is a known ~12% performance regression on vgg16 when run on Nano and TX2.
  • Compared to cuDNN 8.0.4, there is a known ~6% performance regression on ONNX-WaveGlow when run on NVIDIA TITAN RTX.
  • Compared to cuDNN 7.6, there is a significant performance regression on Darknet when run on Nano.
  • For pre-Volta devices, users should align all buffers to at least 4 bytes; this applies to half-precision data as well.
  • Compared to cuDNN version 8.0.2, there is a known 3x performance regression for a single cudnnConvolutionBackwardFilter() use case. We are not aware of any popular model that uses this unique use case.
  • Although the overall cuDNN library size has improved in cuDNN 8.1.0 with CUDA Toolkit 11.2 and greater as compared to cuDNN 8.0.x, the cuDNN library remains large; future releases will attempt to moderate the severity of this issue.
  • CUDNN_CONVOLUTION_BWD_DATA_ALGO_FFT returns an internal error when the number of channels in the filter is greater than or equal to 65536.
  • Execution of a plan for convolution forward operation graph, with engine-global index 1 returns CUDNN_STATUS_INTERNAL_ERROR when the filter format in NHWC and padding is larger than zero.
  • Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
  • Compared to cuDNN 7.6, there are known performance regressions up to 2x on select configurations for AlexNet-like models on NVIDIA Turing, NVIDIA Volta, and NVIDIA Ampere Architecture GPUs.

Limitations

  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multistream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • On K80 GPUs, when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half I/O data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory that might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionCheck()) to load the kernels in the sub library before opening graph capture.
  • Starting in cuDNN version 8.1.0, we are no longer shipping the libfreeimg static library with the MNIST sample. Users can follow the instructions in the readme.txt file to download and compile the library separately and link with the MNIST sample.

1.22. cuDNN Release 8.0.5

This is the cuDNN 8.0.5 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and NVIDIA JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
  • RNN now supports zero-length sequences within the batch when the RNN data layout is CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED or CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED. For more information, see cudnnSetRNNDataDescriptor().
  • Users can now set the environment variable CUDNN_CONV_WSCAP_DBG to a value in MiB to limit the workspace size returned by cudnnConvolutionForwardGetWorkspaceSize(), cudnnConvolutionBackwardDataGetWorkspaceSize(), and cudnnConvolutionBackwardFilterGetWorkspaceSize(). Limiting the workspace might result in performance lost.
  • Significant performance improvements were made for NVIDIA RTX 3090 for many models on many configurations.
  • Performance improvements were made:
    • For EfficientNet, when run using NHWC FP16 Tenor Core configurations on V100 and A100 GPU architectures.
    • For PilotNet, AH-Net, MobileNet V3 on V100 and A100 GPU architectures.
    • For various 3D convolution cases on NVIDIA RTX 8000.
  • Support for the 3D NDHWC layout was added in cudnnConvolutionBackwardFilter().
  • Added instructions for installing cuDNN using the Package Manager for Linux and RHEL users. For step-by-step instructions, refer to the Package Manager Installation.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Fixed Issues

The following issues have been fixed in this release:
  • cudnnBackendFinalize(descriptor), where descriptor is of type CUDNN_BACKEND_ENGINE_DESCRIPTOR() or CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR(), might result in a hang if the operation graph has backward filter operation and the user links against libcudnn.so (cudnn64.dll on Windows). This issue has been fixed in this release.
  • Call to cudnnConvolutionBiasActivationForward() might result in a memory leak in release 8.0.1. This issue has been fixed.
  • Performance regression on the U-Net Industrial network on NVIDIA Volta for certain batch sizes has been fixed.
  • cudnnRNN*() with LSTM mode may produce incorrect results on the cy outputs when clipping is enabled on all GPUs. This issue also exists in previous cuDNN releases since version 7.2.1. This issue has been fixed in this release.
  • cudnnRNNForward* with LSTM mode may produce incorrect results in case of clipping when CUDNN_RNN_ALGO_PERSIST_STATIC is used. This issue also exists in previous cuDNN releases since version 7.2.1. This issue has been fixed in this release.
  • In previous cuDNN versions, cudnnRNNBackwardData() or cudnnRNNBackwardDataEx()may produce non-deterministic outputs when running configurations such as hiddenSize=128 or less, LSTM cell type, and FP32 with CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION. This issue has been fixed in this release.
  • Compared to cuDNN 7.6, there was a known ~6% performance regression on Inception V3 and ResNet-50 models when run using NHWC FP16 configurations on various NVIDIA Turing and NVIDIA TITAN V architectures. This issue has been fixed in this release.
  • Compared to cuDNN v8.0.3, there was a known ~18% performance regression on the U-Net Industrial model when run using NCHW TF32 configurations on V100 and A100 GPU architectures. This issue has been fixed in this release.
  • Updated: November 25, 2020

    When calling cudnnConvolutionBiasActivationForward() with INT8x4 or INT8x32 I/O tensors, it could result in CUDNN_STATUS_BAD_PARAM in 8.0.4. This issue has been fixed in this release.

Known Issues

  • When using cudnnRNN* APIs with the problem sizes (input size, hidden size) not being multiples of 16 for FP16 tensors or multiples of 8 for FP32 tensors, users encountered a return status of CUDNN_STATUS_EXECUTION_FAILED in cudnn built against CUDA 11.0. This issue has been fixed with cuDNN built against CUDA 11.1.
  • The ResNet-50 native FP32 inference issues have been fixed on NVIDIA Volta and NVIDIA Turing. Few performance regressions exist in the NVIDIA Ampere Architecture GPU.
  • cudnnAddTensor() does not support all broadcast-able tensor shapes even though the cuDNN documentation says otherwise.
  • Users have reported that in RNN training with non-zero dropout rate, and if the RNN network is unidirectional, the output of cudnnRNNBackwardWeights() may be non-deterministic. We are still investigating this issue.
  • cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
  • Compared to cuDNN 8.0.0 Preview, there is a known ~12% performance regression on vgg16 when run on Nano and TX2.
  • Compared to cuDNN 8.0.4, there is a known ~6% performance regression on ONNX-WaveGlow when run on NVIDIA TITAN RTX.
  • Compared to cuDNN 7.6, there is a significant performance regression on Darknet when run on Nano.

Limitations

  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 16 byte boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than NVIDIA Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in NVIDIA Volta and later, pad at least one of the dimensions to an even value.
  • On K80 GPUs, when cudnnConvolutionForward() is used with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM algorithm and half I/O data types a silent error might occur when the output width Q is 1 and both height and width padding are zero.
  • Several cuDNN APIs are unable to directly support computations using integer types (CUDNN_DATA_INT8, CUDNN_DATA_INT8x4, CUDNN_DATA_INT8x32 or CUDNN_DATA_INT32). Floating types (particularly CUDNN_DATA_FLOAT) are much more widely supported. If an API does not support the desired type, cudnnTransformTensor() can be used to support the use case by converting to/from a supported type and the desired type. Here are the steps for doing so:
    1. Convert all input tensors from their native type to a supported type (CUDNN_DATA_FLOAT is recommended).
    2. Run cuDNN API using the converted input tensors and output tensor descriptors set as CUDNN_DATA_FLOAT.
    3. Convert all output tensors from a supported type to your desired output type.
    Note: This will require extra memory use for the temporary buffers. Further, this will introduce an additional round trip to memory that might noticeably impact performance.
  • In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
  • In prior versions of cuDNN, some convolution algorithms can use texture-based load structure for performance improvements particularly in older hardware architectures. Users can opt out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who want to continue to use texture-based load, can adapt the new backend API, and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
  • In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 that is 2^29.
  • cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORT when the number of channels exceeds 1024.
  • When using graph-capture, users should call the sub library version check API (for example, cudnnOpsInferVersionChec()) to load the kernels in the sub library before opening graph capture.

1.23. cuDNN Release 8.0.4

This is the cuDNN 8.0.4 release notes. This release includes fixes from the previous cuDNN v8.0.x releases as well as the following additional changes. These release notes are applicable to both cuDNN and NVIDIA JetPack users of cuDNN unless appended specifically with (not applicable for Jetson platforms).

For previous cuDNN documentation, refer to the NVIDIA cuDNN Archives.

Key Features and Enhancements

The following features and enhancements have been added to this release:
GA102 support with improved convolution performance
Now includes convolution heuristics targeting the NVIDIA GA102 GPU. (not applicable for Jetson platforms)
RNN API v8 sample
The new RNN sample illustrating the usage of the new RNN version 8 API has been added. The sample's workflow consists of the several routines to create RNN descriptors, create RNN data descriptors, set up weight space, and compute routines. The sample takes several input parameters that can set up different RNN configurations and input data specifications (data type, cell mode, bias mode, and so on).
RNN functional and performance improvements
ARM Server Base System Architecture (SBSA)
Added support for ARM SBSA for Linux.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the NVIDIA cuDNN Support Matrix.

Limitations

  • Samples can crash unless they are installed in a writable location.
  • RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN 8.0.4 library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream using the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

    When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multihead attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

    The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.

  • The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
  • Some computational options in cuDNN 8.0.4 now require increased alignment on tensors in order to run efficiently. As always, cuDNN recommends users to align tensors to 128-bit boundaries that will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.4 compared to cuDNN v7.6.
  • For certain algorithms, when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.4 users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
  • For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and