Release Notes#

To review cuDNN documentation versions 8.5.0 - 8.6.7, refer to the cuDNN Documentation Archives.

To review cuDNN documentation 9.0.0 and more recent, choose a version from the bottom left navigation selector toggle.

Refer also to the cuDNN frontend release notes.

cuDNN 9.12.0#

These are the NVIDIA cuDNN backend 9.12.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

This release introduces cuDNN JIT, a runtime configuration of cuDNN that provides significant reduction in download and installed binary sizes by supporting only runtime fusion engines, through the graph API, and requiring a smaller subset of cuDNN libraries. For details, refer to cuDNN JIT.

Key Features And Enhancements

The following features and enhancements have been added to this release:

Deviceless ahead-of-time compilation is now supported by cuDNN. You can query cuDNN heuristics and finalize an execution plan with a cuDNN device properties descriptor on CPU-only nodes. Note the following limitations:
- GPU architectures before NVIDIA Ampere are not supported.
- cuDNN heuristics mode CUDNN_HEUR_MODE_B is not supported.
- For some rare cases, not all engineCfgs are available.

Known Issues

For certain convolution-related workloads, memory allocations are made that are not released until process termination. This memory leak does not grow over time.
Runtime compilation of LayerNorm and RMSNorm execution plans might be protracted on compute capability 12.0 devices.
Barrier errors might be reported during forward training and inference for Layer or RMS Normalization when the library is tested with compute-sanitizer synccheck.
Scaled dot-product attention might lead to undefined behavior when the sequence length of key/value is 1.
Decode-phase scaled dot-product attention should usually be run without causal masking because the query sequence is naturally masked to not know future tokens. If it is run with causal masking turned on, and if the layer has Q head count != K/V head count, the result might be mismatches or NaNs. To avoid this issue, do not use causal masking, and provide actual sequence length with the help of a padding mask.
The FORT/FFMA backend does not support dynamic shapes for pooling.
For tensors in the CUDNN_TENSOR_NCHW_VECT_C layout, the serialized vectorized dim (vectDim) might be incorrect. If the serialized tensor is deserialized, it cannot be finalized. Therefore, avoid using serialization with tensors in the CUDNN_TENSOR_NCHW_VECT_C layout. A fix is planned for a future release.
The arm64-sbsa and aarch64-jetson cross-compiled builds of cuDNN install without unversioned symbolic links. As a result, compilation of cuDNN samples fails.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

cuDNN 9.11.0#

These are the NVIDIA cuDNN backend 9.11.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Key Features And Enhancements

The following features and enhancements have been added to this release:

Forwards and backwards attention support for DeepSeek use cases on Blackwell-architecture GPUs has been significantly optimized.
cuDNN now supports block scale Matmul with flexible 1D/2D block sizes through the runtime fusion engine on GPUs based on the NVIDIA Ampere architecture and later GPU architectures.
Paged attention now supports FP8 inputs for both Q and the non-contiguous KV-containers.
LayerNorm forwards now supports the fusion pattern for generating a ReLU bitmask.
LayerNorm backwards now supports the fusion pattern for applying dReLU by using a ReLU bitmask generated by LayerNorm forwards.
GPU architectures earlier than the NVIDIA Turing architecture (namely the NVIDIA Maxwell, NVIDIA Pascal, and NVIDIA Volta GPU architectures) are no longer supported.
The following software releases are no longer supported:
- All NVIDIA CUDA Toolkit 11.x releases
- Ubuntu 20.04

Fixed Issues

The following issues have been fixed in this release:

An issue that caused the dQ and dK buffers to be overwritten with zeros on Ampere-architecture GPUs has been fixed.
An issue where paged scaled dot-product attention might cause illegal memory access on NVIDIA Ampere, Hopper, and Blackwell architectures has been fixed.
An issue with scaled dot-product attention dBias that could cause illegal memory accesses has been fixed.
The global Amax calculation for fp8 scaled dot-product attention now force sets the initial value to 0.0f.
Runtime compilation of scaled dot-product attention for Blackwell-architecture GPUs now uses the exact target architecture flag that matches the device on which the cuDNN graph is being run.
A hang issue on NVIDIA Blackwell-architecture GPUs for paged scaled dot-product attention has been fixed.
An issue that caused scaled dot-product attention statistics to be incorrect if the statistics are not in BHSD layout on GPUs based on the NVIDIA Ampere, NVIDIA Hopper, and NVIDIA Blackwell GPU architectures has been fixed. Now the statistics are allowed to be any layout.
In NVIDIA CUDA Toolkit 12.9, the NVRTC library might produce incorrect output when supporting cuDNN runtime fusion engines. This issue has been fixed with CUDA Toolkit 12.9 Update 1.
An issue with ConvolutionBwdDataRdgrad that could cause an illegal memory access on Turing-architecture GPUs has been fixed.
An issue with all previous cuDNN 9.x releases that could cause cuDNN to heuristically recommend for convolutional workloads engine configurations that improperly required a stricter device pointer alignment than necessary has been fixed.
An issue that caused some JSON serialization fields of data type int64 to be read as int32, causing serialization to fail, has been fixed.
The use of asymmetric padding with CUDNN_ATTR_ENGINE_GLOBAL_INDEX set to 2 for the ConvolutionBwdData operation is now supported.

Known Issues

For certain convolution-related workloads, memory allocations are made that are not released until process termination. This memory leak does not grow over time.
Runtime compilation of LayerNorm and RMSNorm execution plans might be protracted on compute capability 12.0 devices.
Barrier errors might be reported during forward training and inference for Layer or RMS Normalization when the library is tested with compute-sanitizer synccheck.
Scaled dot-product attention might lead to undefined behavior when the sequence length of key/value is 1.
Decode-phase scaled dot-product attention should usually be run without causal masking because the query sequence is naturally masked to not know future tokens. If it is run with causal masking turned on, and if the layer has Q head count != K/V head count, the result might be mismatches or NaNs. To avoid this issue, do not use causal masking, and provide actual sequence length with the help of a padding mask.
The FORT/FFMA backend does not support dynamic shapes for pooling.
For tensors in the CUDNN_TENSOR_NCHW_VECT_C layout, the serialized vectorized dim (vectDim) might be incorrect. If the serialized tensor is deserialized, it cannot be finalized. Therefore, avoid using serialization with tensors in the CUDNN_TENSOR_NCHW_VECT_C layout. A fix is planned for a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

To specify unused segments in scaled dot-product attention with ragged offsets, or to specify a batch with zero actual sequence length, the corresponding value in device memory must be exactly 0 and nothing else.

Deprecated and Removed Features

The following features, which are available only in the cuDNN backend, are deprecated in cuDNN 9.11.0:

Fused Attention fprop, for which the preferred alternative is Fused flash attention fprop
Fused Attention bprop, for which the preferred alternative is Fused flash attention bprop

These features still work in the current release. However, in preparation for the possible removal of these features, use the preferred alternative for each feature.

GPU architectures earlier than the NVIDIA Turing architecture (namely the NVIDIA Maxwell, NVIDIA Pascal, and NVIDIA Volta GPU architectures) are no longer supported.

The following software releases are no longer supported:

All NVIDIA CUDA Toolkit 11.x releases
Ubuntu 20.04

cuDNN 9.10.2#

These are the NVIDIA cuDNN backend 9.10.2 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Fixed Issues

The following issues have been fixed in this release:

A potential hang in Fused Flash Attention fprop has been removed. It is unknown if this hang was observed in production.
Several issues related to using the cuDNN Frontend scaled dot-product attention node’s left-bound and right-bound API have been fixed.

In a future release, the Frontend restriction will be lifted to allow for arbitrary left-bound and right-bound settings.
An illegal memory access in Fused Attention bprop that might affect compute capability 80, 89, and 120 has been fixed.
An issue where incorrect results might be produced with Fused Flash Attention bprop on compute capability 90 when the head dim is larger than 128 and cuDNN Frontend scaled dot-product attention node’s deterministic mode is turned on has been fixed.
An issue that caused JIT compilation failures of attention runtime fusion engines on unsupported GPU architectures has been fixed.
An issue that could cause the API log not to be printed fully when values that are infinity are encountered has been fixed.
An issue in the runtime fusion engine, where a convolution with asymmetric padding or with dilation is called on compute capability 90 and 100, has been fixed.

Known Issues

For certain convolution-related workloads, memory allocations are made that are not released until process termination. This memory leak does not grow over time.
Runtime compilation of LayerNorm and RMSNorm execution plans might be protracted on compute capability 12.0 devices.
Barrier errors might be reported during forward training and inference for Layer or RMS Normalization when the library is tested with compute-sanitizer synccheck.
Scaled dot-product attention might lead to undefined behavior when the sequence length of key/value is 1.
Decode-phase scaled dot-product attention should usually be run without causal masking because the query sequence is naturally masked to not know future tokens. If it is run with causal masking turned on, and if the layer has Q head count != K/V head count, the result might be mismatches or NaNs. To avoid this issue, do not use causal masking, and provide actual sequence length with the help of a padding mask.
The FORT/FFMA backend does not support dynamic shapes for pooling.
For tensors in the CUDNN_TENSOR_NCHW_VECT_C layout, the serialized vectorized dim (vectDim) might be incorrect. If the serialized tensor is deserialized, it cannot be finalized. Therefore, avoid using serialization with tensors in the CUDNN_TENSOR_NCHW_VECT_C layout. A fix is planned for a future release.
The use of asymmetric padding with CUDNN_ATTR_ENGINE_GLOBAL_INDEX set to 2 for the ConvolutionBwdData operation is not supported and might result in a floating point exception.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

cuDNN 9.10.1#

These are the NVIDIA cuDNN backend 9.10.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

The following dev base packages now depend on a common headers package libcudnn9-headers-cuda-X for the cuDNN header files:
- libcudnn9-dev-cuda-X (Ubuntu/Debian)
- libcudnn9-devel-cuda-X (RHEL/Rocky)

Fixed Issues

The following issues have been fixed in this release:

A deadlock issue with scaled dot-product attention when used with compute capability sm10.0 (Blackwell-architecture) GPUs and the FP8 datatype that caused the kernel to hang under some circumstances, such as when the problem size is large or the GPU is running multiple kernels simultaneously, has been fixed.
An issue caused by a change that broke binary backwards-compatibility has been fixed.

Binary backwards-compatibility was broken because the ordering of enums in the header prevented users from running an application that was compiled against an older cuDNN release with a newer cuDNN release. This issue affected only applications that use the block-scale datatypes within cuDNN.

Known Issues

For certain convolution-related workloads, memory allocations are made that are not released until process termination. This memory leak does not grow over time.
Runtime compilation of LayerNorm and RMSNorm execution plans might be protracted on compute capability 12.0 devices.
Barrier errors might be reported during forward training and inference for Layer or RMS Normalization when the library is tested with compute-sanitizer synccheck.
Scaled dot-product attention might lead to undefined behavior when the sequence length of key/value is 1.
Decode-phase scaled dot-product attention should usually be run without causal masking because the query sequence is naturally masked to not know future tokens. If it is run with causal masking turned on, and if the layer has Q head count != K/V head count, the result might be mismatches or NaNs. To avoid this issue, do not use causal masking, and provide actual sequence length with the help of a padding mask.
The FORT/FFMA backend does not support dynamic shapes for pooling.
For tensors in the CUDNN_TENSOR_NCHW_VECT_C layout, the serialized vectorized dim (vectDim) might be incorrect. If the serialized tensor is deserialized, it cannot be finalized. Therefore, avoid using serialization with tensors in the CUDNN_TENSOR_NCHW_VECT_C layout. A fix is planned for a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

cuDNN 9.10.0#

These are the NVIDIA cuDNN backend 9.10.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Key Features And Enhancements

The following features and enhancements have been added to this release:

Enhancements to Fused Flash Attention fprop:
- Performance has been improved for bottom right causal mask, causal mask with shifted diagonals, and sliding window mask.
- Support for arbitrary head dimensions that are multiples of 8 for FP16 or BF16 and multiples of 16 for FP8 has been added to fprop for all architectures.
- Numerical behavior is now more robust against NaNs.
- Page tables can now optionally be provided in a packed format through ragged tensors.
- Performance for any head dimension that is a multiple of 64 has been significantly improved for both fprop and bprop on Hopper-architecture GPUs. For DeepSeek layers in which the QK head dimension is 192 and the V head dimension is 128, performance can be up to 70% faster.

Fixed Issues

The following issues have been fixed in this release:

Operation nodes of type cudnnResampleMode_t with mode CUDNN_RESAMPLE_MAXPOOL now support serialization in forward compatibility mode.
An issue that caused some multi-GPU executions of Matrix Multiplication (for example, MXFP8 Matrix Multiplications) in which the same finalized execution plan is used across different GPUs to be functionally incorrect through the cublasLt engine path when run with the cublasLt library from NVIDIA CUDAToolkit 12.9 has been fixed.
Convolution engines that can support asymmetrical padding now correctly run. Some convolution engines might return CUDNN_NOT_SUPPORTED_PADDING if they cannot handle asymmetrical padding. Engines that worked in cuDNN releases earlier than 9.6.0 that now return CUDNN_NOT_SUPPORTED_PADDING did not properly support that feature and may not have converged.
A numerical issue in wgrad on Hopper-architecture GPUs, which occurred when convolution strides such as depth, width, or height exceeded 8, has been fixed.

Known Issues

For certain convolution-related workloads, memory allocations are made that are not released until process termination. This memory leak does not grow over time.
Runtime compilation of LayerNorm and RMSNorm execution plans might be protracted on compute capability 12.0 devices.
Barrier errors might be reported during forward training and inference for Layer or RMS Normalization when the library is tested with compute-sanitizer synccheck.
Scaled dot-product attention might lead to undefined behavior when the sequence length of key/value is 1.
Decode-phase scaled dot-product attention should usually be run without causal masking because the query sequence is naturally masked to not know future tokens. If it is run with causal masking turned on, and if the layer has Q head count != K/V head count, the result might be mismatches or NaNs. To avoid this issue, do not use causal masking, and provide actual sequence length with the help of a padding mask.
The FORT/FFMA backend does not support dynamic shapes for pooling.
For tensors in the CUDNN_TENSOR_NCHW_VECT_C layout, the serialized vectorized dim (vectDim) might be incorrect. If the serialized tensor is deserialized, it cannot be finalized. Therefore, avoid using serialization with tensors in the CUDNN_TENSOR_NCHW_VECT_C layout. A fix is planned for a future release.
With compute capability sm10.0 (Blackwell-architecture) GPUs, the FP8 datatype with scaled dot-product attention contains a deadlock that causes the kernel to hang under some circumstances, such as when the problem size is large or the GPU is running multiple kernels simultaneously. A fix is planned for a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Deprecated and Removed Features

Side-by-side installation with previous versions of cuDNN is no longer supported.

cuDNN 9.9.0#

These are the NVIDIA cuDNN backend 9.9.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

As of this release, support for GPU architectures earlier than the NVIDIA Turing architecture (namely the NVIDIA Maxwell, NVIDIA Pascal, and NVIDIA Volta GPU architectures) is deprecated.

Key Features And Enhancements

The following features and enhancements have been added to this release:

Enhancements to Fused Flash Attention fprop:
- Scaled dot-product attention now supports an arbitrary head dimension on Blackwell-architecture GPUs for the prefill or forward-pass of training phase.
- True-FP32 fused flash attention support has been added for scientific computing use cases for all architectures with compute capability >= 80.
- Performance has been significantly improved for Paged Attention on Blackwell-architecture GPUs for both prefill and decode. For information about how to obtain the best performance, refer to Fused Flash Attention fprop in the Frontend Developer Guide.
- Performance has been significantly improved on Blackwell-architecture GPUs for graphs with attention score masking when the sequence length is not a multiple of 128.
Support for Adaptive Layer Normalization has been added. This feature can be enabled by using the CUDNN_ADA_LAYER_NORM enum value in cudnnBackendNormMode_t.
The performance of ConvolutionFwd and ConvolutionBwdData mainloop fusions on NVIDIA Blackwell-architecturel GPUs has been improved.
Performance for CUDA compute capability 12.0 and 10.0 on normalization and convolution operations has been improved.

Fixed Issues

The following issues have been fixed in this release:

Decode-phase scaled dot-product attention with group-query or multi-query mode now correctly rejects graphs when k_head % v_heads != 0 (or vice-versa, when v_heads is larger).
An issue that caused the cuDNN runtime engine to occasionally crash on NVIDIA Hopper-architecture and Blackwell-architecture GPUs has been fixed. The fix might cause some performance regressions on these GPUs. A fix for these performance regressions is planned for the next release.
Various s_q = 1 issues with decode-phase scaled dot-product attention have been fixed.

Known Issues

For certain convolution-related workloads, memory allocations are made that are not released until process termination. This memory leak does not grow over time.
Runtime compilation of LayerNorm and RMSNorm execution plans might be protracted on compute capability 12.0 devices.
Operation nodes of type cudnnResampleMode_t with mode CUDNN_RESAMPLE_MAXPOOL do not support serialization in forward compatibility mode. A fix is planned for a future release.
Barrier errors might be reported during forward training and inference for Layer or RMS Normalization when the library is tested with compute-sanitizer synccheck.
Scaled dot-product attention might lead to undefined behavior when the sequence length of key/value is 1.
Decode-phase scaled dot-product attention should usually be run without causal masking because the query sequence is naturally masked to not know future tokens. If it is run with causal masking turned on, and if the layer has Q head count != K/V head count, the result might be mismatches or NaNs. To avoid this issue, do not use causal masking, and provide actual sequence length with the help of a padding mask.
Some convolutions that previously supported asymmetrical padding, no longer do, even though in prior versions they worked correctly. A fix is planned for a future release.
Some multi-GPU executions of Matrix Multiplication (for example, MXFP8 Matrix Multiplications) in which the same finalized execution plan is used across different GPUs will be functionally incorrect through the cublasLt engine path when run with the cublasLt library from NVIDIA CUDAToolkit 12.9.

To avoid any functional regressions, use the cublasLt library from NVIDIA CUDA Toolkit 12.8 with the cuDNN 9.9.0 release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

cuDNN 9.8.0#

These are the NVIDIA cuDNN backend 9.8.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Key Features And Enhancements

The following features and enhancements have been added to this release:

Performance has been significantly improved for fused scaled dot-product attention inference decoder (single-query step) use cases for GPUs based on the NVIDIA Ampere, NVIDIA Ada Lovelace, NVIDIA Hopper, and NVIDIA Blackwell architectures.
Performance has been significantly improved for fused scaled dot-product attention fprop and bprop for use cases on Blackwell-architecture GPUs that have causal masking enabled and for which the overall problem size is relatively small.
It is no longer necessary to wrap the cuDNN static sublibraries with the --whole-archive and --no-whole-archive linker flags within the linker command. Furthermore, if cmake is used, it is no longer necessary to specify WHOLE_ARCHIVE with $<LINK_LIBRARY:WHOLE_ARCHIVE,cudnn_engines_runtime_compiled_static> but only to specify RESCAN with $<LINK_GROUP:RESCAN,cudnn_engines_runtime_compiled_static,cudnn_graph_static>.
Layer Normalization and RMS Normalization fusion support has been enhanced to support partial row broadcast when the number of row dimensions <= 3.
The cuDNN device property has been exposed as a new backend descriptor, which can be set to a cuDNN backend engine or engine heuristics descriptor directly. In addition, the concept of cuDNN backend operation graphs has been improved to be device independent.
Performance has been improved for Matmul with fused operations for GPUs based on the NVIDIA Ampere, NVIDIA Ada Lovelace, NVIDIA Hopper, and NVIDIA Blackwell architectures.
In a scaled dot-product attention forward pass, bias can now be broadcast along the sequence KV dimension. This support for broadcasting is available on both Hopper and Blackwell architectures.
Dynamic shape and kernel cache support for convolution and Matmul runtime fusion has been extended to the NVIDIA Hopper architecture.

Fixed Issues

An issue with the use of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 6 with narrow-precision formats supported only on Blackwell-architecture GPUs that could result in numerically incorrect output for Layer or RMS Normalization forwards inference workloads has been fixed.
An issue that caused the ConvolutionBwdData operation to result in a CUDNN_STATUS_EXECUTION_FAILED error when CUDNN_ATTR_ENGINE_GLOBAL_INDEX was set to 82 has been fixed.
A memory leak in the runtime fusion engines has been fixed.
An integer overflow issue in fused attention support that might cause illegal memory access when the input/output tensors are very large has been fixed.
A segmentation fault that occurs in 3D convolution dynamic shape when cuDNN backend logging is enabled has been fixed
The behavior difference in the dynamic shape feature between full cuDNN and the cuDNN JIT library configuration has been fixed.
An accuracy issue in Matmul runtime fusion dynamic shape has been fixed.
An issue where the fp16 compute type is accidentally allowed in fused attention support and can cause CUDA illegal instruction errors has been fixed.
An illegal instruction issue in forward and backward convolution with padding size exceeding 16 has been fixed.

Known Issues

For certain convolution-related workloads, memory allocations are made that are not released until process termination. This memory leak does not grow over time.
Runtime compilation of LayerNorm and RMSNorm execution plans might be protracted on compute capability 12.0 devices.
The CUDNN_ADA_LAYER_NORM enum value for cudnnBackendNormMode_t currently returns CUDNN_STATUS_NOT_SUPPORTED. A fix is planned for a future release.
Operation nodes of type cudnnResampleMode_t with mode CUDNN_RESAMPLE_MAXPOOL do not support serialization in forward compatibility mode. A fix is planned for a future release.
Barrier errors might be reported during forward training and inference for Layer or RMS Normalization when the library is tested with compute-sanitizer synccheck.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Deprecated and Removed Features

The following features are deprecated in cuDNN 9.8.0:

The attribute CUDNN_ATTR_OPERATIONGRAPH_HANDLE has been deprecated in favor of the following attributes:
- CUDNN_ATTR_ENGINE_DEVICEPROP
- CUDNN_ATTR_ENGINEHEUR_DEVICEPROP
The attribute CUDNN_ATTR_EXECUTION_PLAN_HANDLE has been deprecated in favor of the CUDNN_ATTR_EXECUTION_PLAN_DEVICEPROP attribute.

cuDNN 9.7.1#

These are the NVIDIA cuDNN backend 9.7.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Fixed Issues

The following issues have been fixed in this release:

An issue that caused the ConvolutionBwdFilter operation to produce incorrect results if CUDNN_ATTR_ENGINE_GLOBAL_INDEX was set to 12, 13, 63, or 72 has been fixed.
An issue that caused Layer and RMS Normalization backward engines to produce inaccurate results when zero-centered gamma is enabled for problems with a hidden size of less than 32 has been fixed.
A concurrency bug that caused Matmul operations for block scale datatypes MXFP8 and NVFP4 through the cublasLt engine path to return incorrect results when an execution plan is executed in parallel by multiple threads has been fixed.
A host memory leak in the runtime fusion engine has been fixed.

Known Issues

For certain convolution-related workloads, memory allocations are made that are not released until process termination. This memory leak does not grow over time.
Runtime compilation of LayerNorm and RMSNorm execution plans might be protracted on compute capability 12.0 devices.
The CUDNN_ADA_LAYER_NORM enum value for cudnnBackendNormMode_t currently returns CUDNN_STATUS_NOT_SUPPORTED. A fix is planned for a future release.
Operation nodes of type cudnnResampleMode_t with mode CUDNN_RESAMPLE_MAXPOOL do not support serialization in forward compatibility mode. A fix is planned for a future release.
For the ConvolutionBwdData operation, setting CUDNN_ATTR_ENGINE_GLOBAL_INDEX to 82 might result in a CUDNN_STATUS_EXECUTION_FAILED error. A fix is planned for a future release.
The use of CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 6 with narrow-precision formats supported only on Blackwell-architecture GPUs might result in numerically incorrect output for Layer or RMS Normalization forwards inference workloads. A fix is planned for a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

cuDNN 9.7.0#

These are the NVIDIA cuDNN backend 9.7.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

This release of cuDNN introduces support for the CUDA compute capability 12.0 and 10.0.

Key Features And Enhancements

The following features and enhancements have been added to this release:

Support for CUDA compute capability 12.0 and 10.0 has been added.
Native scaled dot-product attention support with BF16, FP16, and FP8 input and output tensors has been added for compute capability 10.0 using the latest hardware features. The support is at parity with compute capability 9.0 except for backward propagation bias gradient support.
Native matmul and convolution fusion support with TF32, BF16, FP16, FP8, MXFP8, NVFP4 input and output tensors has been added for compute capability 10.0 and 12.0 using the latest hardware features. Mainloop fusion and epilog fusion support for compute capability 10.0 and 12.0 is at parity with compute capability 9.0.
Support for block scale input and output datatypes MXFP8 and NVFP4 in matmul through the cublasLt engine path has been added.
Layer Normalization and RMS Normalization fusion support has been enhanced to support the following prologue and epilogue activation pointwise fusions: Relu, Tanh, Sigmoid, Elu, Gelu, Gelu Approx Tanh, Softplus, Swish.
Block scale quantize and dequantize operations have been introduced to support microscaling data formats such as MXFP8 and NVFP4, particularly for matmul and layer normalization.
The cuDNN documentation has been extensively revised to include documentation for cuDNN frontend and to separate documentation for cuDNN backend from documentation from cuDNN frontend.

Fixed Issues

The following issues have been fixed in this release:

An issue that could cause cudnnFusedOpsExecute() on the Hopper GPU architecture, when used with tensors in the CUDNN_TENSOR_NCHW_VECT_C layout, to result in a CUDNN_STATUS_INTERNAL_ERROR has been fixed.
Instance norm engines are now supported in cuDNN’s GRAPH_JIT_ONLY config.
Documentation for earlier cuDNN 9 releases incorrectly stated that CUDA 11 is supported on ARMv8 (aarch64-jetson). The documentation has been updated to show that cuDNN 9 only supports CUDA 12 on ARMv8 (aarch64-jetson).

Known Issues

For certain convolution-related workloads, memory allocations are made that are not released until process termination. This memory leak does not grow over time.
Runtime compilation of LayerNorm and RMSNorm execution plans might be protracted on compute capability 12.0 devices.
The CUDNN_ADA_LAYER_NORM enum value for cudnnBackendNormMode_t currently returns CUDNN_STATUS_NOT_SUPPORTED. A fix is planned for a future release.
Layer and RMS Normalization backward engines might produce inaccurate results when zero-centered gamma is enabled for problems with a hidden size of less than 32. A fix is planned for a future release.
Operation nodes of type cudnnResampleMode_t with mode CUDNN_RESAMPLE_MAXPOOL do not support serialization in forward compatibility mode. A fix is planned for a future release.
For the ConvolutionBwdFilter operation, setting CUDNN_ATTR_ENGINE_GLOBAL_INDEX to 12, 13, 63, or 72 produces incorrect results. Users should use the cuDNN front-end errata filter to prevent usage of these engines. A fix is planned for the next release.
Matmul opertaions for block scale datatypes MXFP8 and NVFP4 through the cublasLt engine path have a concurrency bug when an execution plan is executed in parallel by multiple threads, leading to incorrect results. A fix is planned for the upcoming cuDNN 9.7.1 release.

Limitations

cuDNN does not strictly follow the Open Compute Project (OCP) specification for the microscaling data formats as enabled by the new block scale quantize and dequantize operations. In particular, to compute the scaling factor of a block, cuDNN divides the absolute maximum value by the largest representable value of the output type, instead of computing the quotient of the exponents. cuDNN then rounds up to obtain the scaling factor. This behavior is motivated by better observed numerical properties than the OCP’s suggested implementation.
On compute-capability 10.0 and 12.0 devices, the specialized ConvBNfprop pattern does not support patterns that were previously supported on 9.0 devices, namely:
- scale+bias+add+ReLU+Conv+GenStats (SBARCS)
- Dual scale+bias+add+ReLU+Conv+GenStats (DSBARCS)

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Deprecated and Removed Features

The following features are deprecated in cuDNN 9.7.0:

Previously, in forward compatibility mode, users were able to specify engine configurations directly without querying cuDNN heuristics and the library would attempt to support the specified engine configuration. This feature has now been deprecated. Henceforth, users who want to use the forward compatibility feature must call heuristics in forward compatibility mode and use only the engine configurations recommended by heuristics.

cuDNN 9.6.0#

These are the NVIDIA cuDNN backend 9.6.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

For the best performance on convolution models, upgrade the CUDA driver version to 12.6 Update 2 or later.

Key Features And Enhancements

The following features and enhancements have been added to this release:

The compilation time of the runtime fusion engine is optimized on the NVIDIA Ampere, Ada, and Hopper GPU architectures.
In addition to the engines supported in cuDNN 9.5.0, the Native CUDA Graph API now supports all attention runtime fusion engines.
Support for scaled dot-product attention with sequence-packed layout has been enhanced and optimized as follows:
- The performance of scaled dot-product attention with sequence-packed layout has been improved for both forward and backward paths for all features, such as MQA and GQA.
- Packed workspace sizes with group-query attention using the following new frontend APIs are now supported:
sdpa_backward_attribute.set_max_total_seq_len_q

sdpa_backward_attribute.set_max_total_seq_len_kv

This feature enables users to achieve significantly reduced memory usage when using group-query attention.
- Ragged offsets represented as the as the CUDNN_DATA_INT64 datatype are now supported.
Layer Norm fusion support has been enhanced as follows:
- The following prologue/epilogue pointwise fusions are now supported: Abs, Ceil, Cos, Exp, Floor, Neg, Sin, Erf, Identity, Rsqrt, Log, Reciprocal, Tan, Add, Add Square, Max, Min, Mul, Sub, Div, Pow, Atan2, Mod.
- Any number of prologue/epilogue fusions is now supported, whereas previously only one prologue/epilogue fusion was supported.
- Row broadcast, column broadcast, and scalar broadcast in prologue/epilogue fusions are now supported.
Error status reporting and logging have been improved in forward compatibility mode.

Fixed Issues

This release of cuDNN provides a workaround for issues with fp8 matrix multiplications in the version of CUDA cuBLAS in CUDA 12.6 Update 2 (namely, CUDA cuBLAS 12.6.3.3). These issues caused cuBLAS to return fp8 matrix multiplication computation graphs as not supported.

Known Issues

cudnnFusedOpsExecute on the Hopper GPU architecture, when used with tensors in the CUDNN_TENSOR_NCHW_VECT_C layout, can result in a CUDNN_STATUS_INTERNAL_ERROR. A fix is planned for a future release.
Instance norm engines are not supported in cuDNN’s GRAPH_JIT_ONLY config. Support is planned for a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

cuDNN 9.5.1#

These are the NVIDIA cuDNN 9.5.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Fixed Issues

The attribute CUDNN_ATTR_TENSOR_IS_BY_VALUE for CUDNN_ATTR_OPERATION_NORM_FWD_EPSILON_DESC and CUDNN_ATTR_OPERATION_NORM_BWD_EPSILON_DESC must now be set to true.
The wgrad performance of depthwise convolution on A100 GPUs has been improved.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 2 for norm forward inference, norm forward training, and norm backward operations can fail with CUDNN_STATUS_ARCH_MISMATCH on GPU architectures before Ampere and Ada Lovelace. In cuDNN 9.5.0, the returned error was CUDNN_STATUS_EXECUTION_FAILED. A complete fix for this issue is planned for a future release.
A potential hang during the operation graph finalization phase has been fixed.
An issue that caused cuDNN to implicitly expect grad-output and output to have the same stride during Fused Flash Attention bprop has been fixed.

Known Issues

cudnnFusedOpsExecute on the Hopper GPU architecture, when used with tensors in the CUDNN_TENSOR_NCHW_VECT_C layout, can result in a CUDNN_STATUS_INTERNAL_ERROR. A fix is planned for a future release.
Instance norm engines are not supported in cuDNN’s GRAPH_JIT_ONLY config. Support is planned for a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

cuDNN 9.5.0#

These are the NVIDIA cuDNN 9.5.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

A new Native CUDA Graph API has been introduced. For select engines, this allows the direct construction of a CUDA graph (not to be confused with a cuDNN graph) representing an execution plan. This is a more flexible alternative to using CUDA graph capture, as it allows the updating of an existing CUDA graph with new variant pack pointers.

Key Features and Enhancements

The following features and enhancements have been added to this release:

Stream-K flash attention has been introduced to speed up the decoding phase of LLM inference, where the sequence length of a Q tensor equals 1 for both Hopper and Ampere GPUs, by 200% on average.
Support for Layer and RMS normalization in the GRAPH_JIT_ONLY config of cuDNN has been enhanced.
Performance of GEMM fusions and convolution weight gradient computation fusions with large gemm-K dimension is improved in the runtime fusion engines of cuDNN.
Support for fused flash attention with all d values less than or equal to 256 and multiples of 8 for bprop has been added in the cuDNN Runtime Fused Flash Attention Engine for Hopper GPUs.
The FP16 and BF16 scaled dot-product attention engine now supports the int64_t datatype for ragged offsets, which are used in packed sequence tensor layouts.
The FP16 and BF16 scaled dot-product attention engine now supports the addition of ragged offsets to softmax_stats tensors.
Performance for matrix multiplication with certain fusions through the cuBLASLt engine for all supported data types - FP16, BF16, E4M3 and E5M2 - has been enhanced. The newly added coverage for the engine includes epilogue fusion for dReLu and dGeLu with an optional dBias. These backward pass fusions consume the auxiliary tensor’s output by the corresponding forward pass compute primitives.
Pooling performance was upgraded significantly for pooling windows with a dimension greater than 8.

Fixed Issues

The following issues have been fixed in this release:

An issue in the runtime fusion engine that could cause the convolution fprop and convolution backward data computations to cause IMA with downcasted mixed input on Ampere, Ada Lovelace, and Hopper GPUs has been fixed.
An issue in the runtime fusion engine that could cause grouped convolution fprop and convolution backward data computations with channel broadcast mainloop fusions to get incorrect results on Hopper GPUs has been fixed.
An issue issue in the runtime fusion engine that could cause convolution fprop and convolution backward data computations with layer reduction fusion or the convolution backward weights with scalar or channel reduction fusion to get incorrect results on Ampere, Ada Lovelace, and Hopper GPUs has been fixed.
An issue in the runtime fusion engine that could cause the matmul, convolution fprop, convolution backward data, and convolution backward weights with column or instance reduction fusion to hang on Hopper GPUs has been fixed.
A regression to zero sequence length support for FP16 and BF16 scaled dot-product attention which caused an illegal instruction error has now been fixed.
For some ConvolutionBwdFilter depthwise convolutional workloads, cuDNN previously could generate racecheck hazard warnings. This error has been corrected.

Known Issues

Instance norm engines are not supported in cuDNN’s GRAPH_JIT_ONLY config. Support is planned for a future release.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 2 for norm forward infer, norm forward train and norm backward operations might fail with CUDNN_STATUS_EXECUTION_FAILED on GPU architectures before Ampere and Ada Lovelace. A fix is planned for a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

The new Native CUDA Graph API is only supported for a limited number of engines, with support for more engines expected to be added in future releases. For the initial release of this feature, a selection of convolution, matmul and pointwise runtime fusion patterns are supported. Supported engines have the behavior note CUDNN_BEHAVIOR_NOTE_SUPPORTS_NATIVE_GRAPH_API. The Native CUDA graph API also requires a cuDNN compiled to CUDA runtime 12.x or above.
The dynamic shape feature requires CUDA runtime 12.x or above, because the underlying engine currently supporting this feature has this requirement.

cuDNN 9.4.0#

These are the NVIDIA cuDNN 9.4.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Key Features and Enhancements

The following features and enhancements have been added to this release:

Paged attention is now enabled by the newly introduced paged cache load operation CUDNN_BACKEND_OPERATION_PAGED_CACHE_LOAD_DESCRIPTOR, which is supported by Ampere and Hopper GPUs. This operation can be combined with the existing flash fprop attention kernel. Paged attention allows for more efficient memory usage by storing K/V caches in non-contiguous memory, and using page tables to reconstruct them. For more information, refer to the Developer Guide, CUDNN_BACKEND_OPERATION_PAGED_CACHE_LOAD_DESCRIPTOR in cudnn_graph Library, and the Paged Attention paper.
Improved performance for scaled dot product attention on Hopper. Speedup may vary from 5% to 30% depending on layer shape and input data type.
Function names of attention kernels have been enhanced with more details on instruction and kernel type. For example, cudnn_generated_fort_native_sdpa_sm80_flash_fprop_wmma_f16_knob_32_64x64x64_4x1x1_kernel0_0.
Performance improvements for scaled dot product attention with variable sequence length on Hopper. Kernel execution times for forward pass are now proportional to actual sequence lengths of query as opposed to maximum sequence lengths of query in earlier versions.
Dynamic shape with kernel cache is added as a new feature to significantly reduce execution plan finalizing time for use cases that have same-topology dynamic shape operation graph. This is done by binding the previously compiled applicable kernel to the execution plan instead of re-compiling a new one from scratch. Refer to CUDNN_ATTR_EXECUTION_PLAN_KERNEL_CACHE and CUDNN_ATTR_OPERATIONGRAPH_IS_DYNAMIC_SHAPE_ENABLED from the API Reference for more details.
Performance for matrix multiplication with certain fusions through the cuBLASLt engine for E4M3 and E5M2 data types has been enhanced. The newly added coverage for the engine includes:
- Support for GEMM alpha and beta scale factors.
- Epilogue fusion for Bias with ReLu and GeLu, and with the ReLu or GeLu auxiliary tensor output required for the backward pass.
- Fusion with Bias gradients for input matrices A or B.
Runtime fusion engine performance improvements for matrix multiplication with large K dimension on NVIDIA’s Ampere, Ada, and Hopper GPUs.
Added zero centered gamma support for layer norm and RMS norm.

Fixed Issues

The following issues have been fixed in this release:

Scaled dot-product attention (SDPA) numerics enhancements by using more accurate math in the softmax part of the kernel.
Graphs containing norm forward inference operation would fail to serialize in cuDNN 9.3. This has been fixed in cuDNN 9.4.

Known Issues

Generic runtime fusion engine support surface 70 does not support conv, dgrad, and wgrad filters with spatial dimensions larger than 32.
The FP16 and BF16 scaled dot-product attention (SDPA) engine with variable sequence length has a regression from cuDNN version 9.3.0 where enabling zero-sequence-length values results in an illegal instruction error.
For some ConvolutionBwdFilter depthwise convolutional workloads, cuDNN may generate racecheck hazard warnings. This issue exists in previous v9 releases and will be fixed in a future release.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

cuDNN 9.3.0#

These are the NVIDIA cuDNN 9.3.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

The cudnnBackendInitialize() function has been marked deprecated.

Key Features and Enhancements

The following features and enhancements have been added to this release:

The cuDNN v8 convolution API has been extended to support tensors with a batch size larger than 2 Giga-elements.
FP16 and BF16 scaled-dot-product attention (SDPA) with variable sequence length supports zero-sequence-length values. This feature enables the use of dynamic batch sizes without the need for recompilation. cuDNN performs a no-op for a batch when the query sequence length and the key/value sequence length are both zero.
Support for SM Carveout has been extended to the backend API for batch norm on NVIDIA Ampere and Hopper GPUs.
Error messages generated during retrieval of the CUDNN_ATTR_ENGINEHEUR_RESULTS attribute are accessible through the cudnnGetLastErrorString() function.
The forward compatibility of the library has been enhanced as follows:
- Batch Normalization and APIs related to Normalization in the Legacy API have been made forward compatible.
- RNN APIs have been made forward compatible.
- Fusion patterns involving grouped convolutions, that is convolutions with G>1, have been made forward compatible for convolution forward operations.
Performance for matrix multiplication with certain fusions through the cuBLASLt engine for FP16 and BF16 data types has been enhanced. The newly added coverage for the engine includes:
- Support for GEMM alpha and beta scale factors
- Epilogue fusion for Bias with ReLu and GeLu, and with the ReLu or GeLu auxiliary tensor output required for the backward pass
For detailed information about supported datatypes, refer to cublasLtMatmul() in the cuBLAS documentation.
The runtime fusion engine now supports tensors that are not fully packed for matmul and convolution on the NVIDIA’s Ampere, Ada Lovelace, and Hopper architectures.
Support for serialization and deserialization of execution plans to avoid recompilation of runtime-generated kernels has been extended to support all normalization backend engines.
Layer and RMS Normalization now support optional amax when outputting tensors with FP8 data types.

Fixed Issues

The following issues have been fixed in this release:

Scaled dot-product attention (SDPA) with ragged offsets, when executed multiple times, no longer exhibits undefined behavior. Earlier, successive runs of SDPA with ragged offsets could hang or cause illegal accesses.
SDPA with ragged offsets, when used with variable sequence lengths, no longer causes invalid memory access.
Additional fixes have been made to address a numerical issue in FP16 and BF16 SDPA fprop, where the kernel would generate unexpected outputs in the softmax stats tensor for large query and key tensor values. A partial fix was released in cuDNN 9.2.1, and the latest fix offers a more complete resolution.
The cuBLASLt library in NVIDIA CUDA Toolkit 12.6 fixes the bug with the e5m2 data type for FP8 matrix multiplication. Therefore, the cuBLASLt engine is reenabled for such cases for cuBLASLt from version 12.6 onwards.
The cudnnReduceTensor() function can now correctly fetch data from the input tensor when the input tensor has the same data type as the compute type and a format other than NCHW.
Running conv-bias-act fusions with CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 14 no longer generates incorrect results when the activation mode is CUDNN_ACTIVATION_IDENTITY, or there’s not an activation node in the computation graph.
In-place operation is now allowed for the cudnnSoftmaxForward() function.

Known Issues

There are known performance regressions on several convolution models with NCHW format on Orin cards compared to cuDNN 8.9.x. For better performance, switch to NHWC format.
With the cuDNN runtime fusion engine for GEMMs, output tensors with a packed-Boolean data type might return incorrect results when the batch size is greater than one.
Some graphs containing the norm forward operation in inference mode fail to serialize.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Deprecated and Removed Features

The cudnnStatus_t cudnnBackendInitialize(cudnnBackendDescriptorType_t) function has been marked deprecated. This function is not used in cudnn-frontend either.

cuDNN 9.2.1#

These are the NVIDIA cuDNN 9.2.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Key Features and Enhancements

The following features and enhancements have been added to this release:

Enhanced heuristics for mixed input matmul runtime fusions with large gemm-k dimension.

Fixed Issues

The following issues have been fixed in this release:

Fixed a numerical issue in FP16 and BF16 fused flash attention fprop where the kernel would generate unexpected outputs in softmax stats tensor for large query and key tensor values.
Fixed a functional bug in FP8 fused flash attention fprop where the kernel gave the wrong results since version cuDNN 9.1.1.
When the product of the convolution N, C and K dimension exceeds the max value of int32_t, CUDNN_CONVOLUTION_BWD_FILTER_1X1_AS_GEMM_ENGINE engine could fail with out-of-bounds memory access. This issue has been fixed.

Known Issues

The ConvBNwgrad pre-compiled engine does not support Cooperative Group Array = 3*3, 1*9 and 9*1 on NVIDIA Hopper.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 67 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 51 for convolution forward with bias and activation (ConvBiasAct operation) may fail race-check testing when the library is tested under compute-sanitizer.
Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for Norm Backward operations with cudnnBackendNormMode_t set to CUDNN_RMS_NORM is not CUDA minor version compatible for toolkits 12.2 and 12.3. Users of this engine should install the updated driver that ships with the toolkit.
It is known that cudnnNanPropagation_t may not be respected in conv-bias-relu fusions.
There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND, or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
The support for FP8 matrix multiplication through the cuBLASLt library path is restricted only to e4m3 data types due to an existing bug in cuBLASLt available with the NVIDIA CUDA Toolkit 12.5. The support for e5m2 data type will be added in a future cuDNN version after the bug in cuBLASLt is fixed.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
Samples must be installed in a writable location, otherwise the samples can crash.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

cuDNN 9.2.0#

These are the NVIDIA cuDNN 9.2.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

Introducing the GRAPH_JIT_ONLY configuration of cuDNN, which includes all JIT engines while dropping engines based on precompiled kernels. This configuration dramatically reduces the required binary size. It supports NVIDIA Ampere and later GPUs, covering a large subset of the graph API. Refer to the cuDNN Library Configuration section of the cuDNN Developer Guide for more information.

Key Features and Enhancements

The following features and enhancements have been added to this release:

Support for activation functions like CUDNN_POINTWISE_TANH_FWD has been added to fused flash attention fprop and bprop on Hopper GPUs. This support has also been added in fused flash attention fprop on Ampere GPUs. For more information, refer to the cuDNN Developer Guide.
Support for matrix multiplication through the cuBLASLt library in cuDNN is extended to FP8 data types. Prior to this release, the support was restricted to FP16 data types. For cuDNN graphs with FP8 data types and associated quantization and dequantization scales, if the graph is supported by cuBLASLt, cuDNN heuristics will return the engine using the cuBLASLt library first in the list. This support through cuBLASLt doesn’t exist for the GRAPH_JIT_ONLY configuration.
Added returned cudnnStatus_t error codes logging to all cudnnBackend*() APIs in the INFO level logging.
Added logging of visualizable graph information on cudnnBackendFinalize() for CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR and CUDNN_BACKEND_ENGINECFG_DESCRIPTOR in INFO level logging.
You may now check whether a workload is supported by finalizing an EngineConfig; previously it was required to finalize an ExecutionPlan (with the EngineConfig) to determine support. Additionally, EngineConfigs returned by heuristics queries are now guaranteed to be executable.
In the runtime fusion engine:
- ConvolutionBwdFilter and ConvolutionBwdData with FP32 and FP8 input data type now support NHWC input tensor layout on NVIDIA Hopper.
- Optimized the performance for the ConvolutionFwd operation on use cases with narrow channel size on NVIDIA Ampere, NVIDIA Ada Lovelace, and Hopper.
- Relaxed the input tensor alignment requirement from 128-bit to 32-bit for all the Matmul and ConvolutionFwd use cases with mainloop fusion on Ampere and Ada.
- Relaxed the output alignment requirement down to 8-bit on Ampere, Ada, and Hopper.
Fused Batch Normalization + ReLU graph pattern now support specifying ReLU with lower and upper clips.
Layer and RMS Normalization now support FP8 data types for output tensors.

Fixed Issues

The following issues have been fixed in this release:

Layer and RMS Normalization now support 2D/3D tensors and attempt to infer normalizing dimensions from input tensor shapes. If inference is not possible, the engines default to normalizing over the last dimension for 2D/3D tensors and over all but the first dimension for 4D/5D tensors.
cudnnCTCLoss() now supports execution under graph capture mode.
Runtime engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) for Layer and RMS Normalization forward could compile slowly when dealing with large hidden sizes in previous releases. This slow compilation issue has been fixed in this release.

Known Issues

CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 67 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 51 for convolution forward with bias and activation (ConvBiasAct operation) may fail race-check testing when the library is tested under compute-sanitizer.
Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for Norm Backward operations with cudnnBackendNormMode_t set to CUDNN_RMS_NORM is not CUDA minor version compatible for toolkits 12.2 and 12.3. Users of this engine should install the updated driver that ships with the toolkit.
It is known that cudnnNanPropagation_t may not be respected in conv-bias-relu fusions.
There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND, or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
The support for FP8 matrix multiplication through the cuBLASLt library path is restricted only to e4m3 data types due to an existing bug in cuBLASLt available with the NVIDIA CUDA Toolkit 12.5. The support for e5m2 data type will be added in a future cuDNN version after the bug in cuBLASLt is fixed.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
Samples must be installed in a writable location, otherwise the samples can crash.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

Deprecated and Removed Features

The following features are removed in cuDNN 9.2.0:

cuDNN documentation prior to version 8.5.0 will be removed from the web.. Refer to the Documentation Archived topic for access to previously released documentation.

cuDNN 9.1.1#

These are the NVIDIA cuDNN 9.1.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Key Features and Enhancements

The following features and enhancements have been added to this release:

Expanded the support for FP8 fused flash attention fprop on NVIDIA Hopper GPUs to a hidden dimension of up to 256.

Fixed Issues

The following issues have been fixed in this release:

On NVIDIA Hopper and Ampere architectures, fused flash attention bprop would return an execution failed when sequence lengths greater than or equal to 512k were used. This issue has been fixed.
Fixed heuristics regression in version 9.1.0 with mixed input precision matmul/convolution use cases.
On NVIDIA Hopper architectures, fused flash attention bprop could lead to incorrect results in version 8.9.7; we missed documenting the issue in that version. This issue was fixed in version 9.0.0.
The scale and bias tensor descriptor checks for Layer and RMS Normalization have been updated to require the normalizing dimensions to be specified in the correct order. These checks were missing in cuDNN version 9.0 and prior versions.

Known Issues

CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 67 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 51 for convolution forward with bias and activation (ConvBiasAct operation) may fail race-check testing when the library is tested under compute-sanitizer.
Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for Norm Backward operations with cudnnBackendNormMode_t set to CUDNN_RMS_NORM is not CUDA minor version compatible for toolkit 12.2 and beyond. Users of this engine should install the updated driver that ships with the toolkit.
It is known that cudnnNanPropagation_t may not be respected in conv-bias-relu fusions.
There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND, or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
Samples must be installed in a writable location, otherwise the samples can crash.
RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

Deprecated and Removed Features

The following features are removed in cuDNN 9.1.1:

cuDNN documentation prior to version 8.5.0 will be removed from the web.. Refer to the Documentation Archived topic for access to previously released documentation.

cuDNN 9.1.0#

These are the NVIDIA cuDNN 9.1.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

Frame pointers have been enabled which allows for better runtime visibility and traceability, and allows for easier exchange of runtime information with NVIDIA when needed for debugging purposes. Refer to the cuDNN Developer Guide for more information on how to symbolize stack traces with obfuscated symbols from the cuDNN symbol server.

Key Features and Enhancements

The following features and enhancements have been added to this release:

cuDNN Fused Flash Attention support has been expanded to FP8 datatypes for NVIDIA Hopper GPUs. For more information, refer to the cuDNN Developer Guide.
cuDNN BF16 and FP16 Fused Flash Attention now supports embedding dim = 256 use cases in forward propagation.
Expanded support of FP16 and BF16 Fused Flash Attention by adding the sliding window attention feature on NVIDIA Ampere and Hopper GPUs. For more information, refer to the cuDNN Developer Guide.
Improved single op matmul (that is, unfused) performance with a new engine that can dispatch calls to cuBLASLt when it supports the given matmul parameters.
The cudnnGetRNNWeightParams() function was improved without changing its prototype. You can now make two separate calls to cudnnGetRNNWeightParams():
- one invocation to retrieve weight matrix shapes (bDesc=NULL, bAddr=NULL),
- one invocation to retrieve bias dimensions (mDesc=NULL, mAddr=NULL)
This calling pattern resembles two dedicated RNN APIs as in cuDNN 7.x: cudnnGetRNNLinLayerMatrixParams() and cudnnGetRNNLinLayerBiasParams(). The cudnnGetRNNWeightParams() function permits the weightSpace argument to be NULL. This way, you can retrieve weight/bias offsets instead of actual pointers within the buffer specified by the weightSpace address.
Elementwise affine operations (Scale, DScale, Bias, DBias) are now optional for both forward and backward passes in LayerNorm and RMSNorm.

Fixed Issues

The following issues have been fixed in this release:

On NVIDIA Hopper architectures, incorrect results were possible in Fused Flash Attention with packed layout when the embedding dimension per head for query and value were different. This issue has been fixed.
When using the cuDNN static library, you had to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. This restriction has been lifted for the 12.x build, which now allows you to use any minor version of CUDA Toolkit 12. The 11.x build still has this restriction, which is documented in the cuDNN Support Matrix.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights was observing a performance regression when moving from cuDNN 8.8 to cuDNN 8.9. The regression has been fixed in cuDNN 9.0.0.
Max pooling for INT8 returned -127 when the values in the windows were all equal to -128. This issue is now resolved as of this release.

Known Issues

Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for Norm Backward operations with cudnnBackendNormMode_t set to CUDNN_RMS_NORM is not CUDA minor version compatible for toolkit 12.2 and beyond. Users of this engine should install the updated driver that ships with the toolkit.
It is known that cudnnNanPropagation_t may not be respected in conv-bias-relu fusions.
There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND, or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
Samples must be installed in a writable location, otherwise the samples can crash.
RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

Deprecated and Removed Features

The following features are removed in cuDNN 9.1.0:

ppc64le is no longer supported. The last release to support ppc64le was 9.0.0.
RHEL7 is no longer supported. The last release to support RHEL7 was 9.0.0.
cuDNN documentation prior to version 8.5.0 will be removed in an upcoming release.

cuDNN 9.0.0#

These are the NVIDIA cuDNN 9.0.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

This is the first major version bump of cuDNN in almost 4 years. There are some exciting new features and also some changes that may be disruptive to current applications built against prior versions of cuDNN. This section provides more details.
The cuDNN library is reorganized into several sub-libraries, which, in a future cuDNN version, will allow for more flexibility in loading selected parts of the cuDNN library. For more information, refer to the API Overview.
For a list of added, deprecated, and removed APIs, refer to API Changes for cuDNN 9.0.0.
cuDNN no longer depends on the cuBLAS library; instead cuDNN now depends on the cuBLASLt library for certain primitive linear algebra operators.
The definition of CUDNN_VERSION has been changed to CUDNN_MAJOR * 10000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL from CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL. Refer to Version Checking Against CUDNN_VERSION in the cuDNN Developer Guide.
cuDNN now has RPM and Debian meta-packages available for easy installation.
sudo apt-get install -y cudnn
This command installs the latest available cuDNN for the latest available CUDA version. Refer to the cuDNN Installation Guide for more details.
Starting with cuDNN 9.0.0, an important subset of operation graphs are hardware forward compatible. cuDNN 9.0.0 and subsequent releases will work on all current and future GPU architectures subject to specific constraints as documented in the cuDNN Developer Guide.

Key Features and Enhancements

The following features and enhancements have been added to this release:

The cuDNN backend API uses less memory for many execution plans which should be beneficial for users who cache execution plans.
FP16 and BF16 fused flash attention engine performance has been significantly improved for NVIDIA GPUs:
- Speed-up of up to 50% over cuDNN 8.9.7 on Hopper GPUs.
- Speed-up of up to 100% over cuDNN 8.9.7 on Ampere GPUs.
Expanded support of FP16 and BF16 flash attention by adding the gradient for relative positional encoding on NVIDIA Ampere GPUs.
The fusion engine enables pointwise operations in the mainloop to be fused on both input A and B for matmul. The fused pointwise operation can be either a scalar, row, column broadcast, or a full tensor pointwise operation. Mixed precision is also supported for both input A and B. This new feature is only available for NVIDIA Hopper GPUs.
Updated the cuDNN Graph API execution plan serialization JSON schema to version 3.
Introduced more specific error codes and error categories (BAD_PARAM, NOT_SUPPORTED, INTERNAL_ERROR, EXECUTION_FAILED) which helps checking errors in these two levels of granularities. A macro CUDNN_ERROR_CATEGORY is introduced for extracting the error category from a specific error code.
Introduced nested logging levels, by setting CUDNN_LOGLEVEL_DBG, where the more severe levels are included by enabling the less severe levels. This better adheres to common practices and increases error reporting visibility. cuDNN version 8.x logging environment variables CUDNN_LOGERR_DBG, CUDNN_LOGWARN_DBG, and CUDNN_LOGINFO_DBG are deprecated and will continue to work during the cuDNN version 9.x grace period for compatibility.
Introduced cudnnGetLastErrorString API to fetch the latest error message.
The thread-safety of cuDNN is notably improved in this release. Concurrent execution of execution plans is now supported

Fixed Issues

On NVIDIA Ampere and Hopper architectures, invalid memory accesses were possible in variable sequence lengths and when the padded sequence length was not a multiple of 64. This issue has been fixed.
On NVIDIA Ampere and Hopper architectures, incorrect results were possible when the sequence length for query was less than 64. This issue has been fixed.
Fixed an accuracy issue in which FP32 input data is truncated instead of rounded into TF32 data on the NVIDIA Hopper fusion engine.
Previously on Linux, when cuDNN would load one of its other sub-libraries, it might attempt to load a mismatched version of cuDNN, possibly causing an application error. This issue has been fixed; it will look for the library file with complete version suffix first, and fall back to more generic version suffixes.
Fixed a serialization issue when a deserialized execution plan produced wrong results due to passing kernel parameters incorrectly.
For the ConvBNFusion operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 2 had a race condition for some problem sets. This issue was fixed in cuDNN version 8.9.6.
The NaN propagation is guaranteed under the CUDNN_PROPAGATE_NAN mode.

Known Issues

Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for Norm Backward operations with cudnnBackendNormMode_t set to CUDNN_RMS_NORM is not CUDA minor version compatible for toolkit 12.2 and beyond. Users of this engine should install the updated driver that ships with the toolkit.
It is known that cudnnNanPropagation_t may not be respected in conv-bias-relu fusions.
There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the “compute-sanitizer” tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. Extensive testing on H100, with different clock speeds and computational loads, did not reveal any impact on functional results that were always identical and correct. This issue is currently being investigated.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND, or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the cuDNN Support Matrix for the exact supported CUDA versions.
In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
Samples must be installed in a writable location, otherwise the samples can crash.
RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

Deprecated and Removed Features

The following features are deprecated in cuDNN 9.0.0:

For a list of deprecated and removed APIs, refer to API Changes for cuDNN 9.0.0.