Release Notes

To review cuDNN documentation versions 8.5.0 - 8.6.7, refer to the cuDNN Documentation Archives.

To review cuDNN documentation 9.0.0 and more recent, choose a version from the bottom left navigation selector toggle.

cuDNN 9.2.1

These are the NVIDIA cuDNN 9.2.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Key Features and Enhancements

The following features and enhancements have been added to this release:

Enhanced heuristics for mixed input matmul runtime fusions with large gemm-k dimension.

Fixed Issues

The following issues have been fixed in this release:

Fixed a numerical issue in FP16 and BF16 fused flash attention fprop where the kernel would generate unexpected outputs in softmax stats tensor for large query and key tensor values.
Fixed a functional bug in FP8 fused flash attention fprop where the kernel gave the wrong results since version cuDNN 9.1.1.
When the product of the convolution N, C and K dimension exceeds the max value of int32_t, CUDNN_CONVOLUTION_BWD_FILTER_1X1_AS_GEMM_ENGINE engine could fail with out-of-bounds memory access. This issue has been fixed.

Known Issues

The ConvBNwgrad pre-compiled engine does not support Cooperative Group Array = 3*3, 1*9 and 9*1 on NVIDIA Hopper.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 67 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 51 for convolution forward with bias and activation (ConvBiasAct operation) may fail race-check testing when the library is tested under compute-sanitizer.
Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for Norm Backward operations with cudnnBackendNormMode_t set to CUDNN_RMS_NORM is not CUDA minor version compatible for toolkits 12.2 and 12.3. Users of this engine should install the updated driver that ships with the toolkit.
It is known that cudnnNanPropagation_t may not be respected in conv-bias-relu fusions.
There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND, or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
The support for FP8 matrix multiplication through the cuBLASLt library path is restricted only to e4m3 data types due to an existing bug in cuBLASLt available with the NVIDIA CUDA Toolkit 12.5. The support for e5m2 data type will be added in a future cuDNN version after the bug in cuBLASLt is fixed.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
Samples must be installed in a writable location, otherwise the samples can crash.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

cuDNN 9.2.0

These are the NVIDIA cuDNN 9.2.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

Introducing the GRAPH_JIT_ONLY configuration of cuDNN, which includes all JIT engines while dropping engines based on precompiled kernels. This configuration dramatically reduces the required binary size. It supports NVIDIA Ampere and later GPUs, covering a large subset of the graph API. Refer to the cuDNN Library Configuration section of the cuDNN Developer Guide for more information.

Key Features and Enhancements

The following features and enhancements have been added to this release:

Support for activation functions like CUDNN_POINTWISE_TANH_FWD has been added to fused flash attention fprop and bprop on Hopper GPUs. This support has also been added in fused flash attention fprop on Ampere GPUs. For more information, refer to the cuDNN Developer Guide.
Support for matrix multiplication through the cuBLASLt library in cuDNN is extended to FP8 data types. Prior to this release, the support was restricted to FP16 data types. For cuDNN graphs with FP8 data types and associated quantization and dequantization scales, if the graph is supported by cuBLASLt, cuDNN heuristics will return the engine using the cuBLASLt library first in the list. This support through cuBLASLt doesn’t exist for the GRAPH_JIT_ONLY configuration.
Added returned cudnnStatus_t error codes logging to all cudnnBackend*() APIs in the INFO level logging.
Added logging of visualizable graph information on cudnnBackendFinalize() for CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR and CUDNN_BACKEND_ENGINECFG_DESCRIPTOR in INFO level logging.
You may now check whether a workload is supported by finalizing an EngineConfig; previously it was required to finalize an ExecutionPlan (with the EngineConfig) to determine support. Additionally, EngineConfigs returned by heuristics queries are now guaranteed to be executable.
In the runtime fusion engine:
- ConvolutionBwdFilter and ConvolutionBwdData with FP32 and FP8 input data type now support NHWC input tensor layout on NVIDIA Hopper.
- Optimized the performance for the ConvolutionFwd operation on use cases with narrow channel size on NVIDIA Ampere, NVIDIA Ada Lovelace, and Hopper.
- Relaxed the input tensor alignment requirement from 128-bit to 32-bit for all the Matmul and ConvolutionFwd use cases with mainloop fusion on Ampere and Ada.
- Relaxed the output alignment requirement down to 8-bit on Ampere, Ada, and Hopper.
Fused Batch Normalization + ReLU graph pattern now support specifying ReLU with lower and upper clips.
Layer and RMS Normalization now support FP8 data types for output tensors.

Fixed Issues

The following issues have been fixed in this release:

Layer and RMS Normalization now support 2D/3D tensors and attempt to infer normalizing dimensions from input tensor shapes. If inference is not possible, the engines default to normalizing over the last dimension for 2D/3D tensors and over all but the first dimension for 4D/5D tensors.
cudnnCTCLoss() now supports execution under graph capture mode.
Runtime engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) for Layer and RMS Normalization forward could compile slowly when dealing with large hidden sizes in previous releases. This slow compilation issue has been fixed in this release.

Known Issues

CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 67 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 51 for convolution forward with bias and activation (ConvBiasAct operation) may fail race-check testing when the library is tested under compute-sanitizer.
Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for Norm Backward operations with cudnnBackendNormMode_t set to CUDNN_RMS_NORM is not CUDA minor version compatible for toolkits 12.2 and 12.3. Users of this engine should install the updated driver that ships with the toolkit.
It is known that cudnnNanPropagation_t may not be respected in conv-bias-relu fusions.
There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND, or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
The support for FP8 matrix multiplication through the cuBLASLt library path is restricted only to e4m3 data types due to an existing bug in cuBLASLt available with the NVIDIA CUDA Toolkit 12.5. The support for e5m2 data type will be added in a future cuDNN version after the bug in cuBLASLt is fixed.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
Samples must be installed in a writable location, otherwise the samples can crash.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

Deprecated and Removed Features

The following features are removed in cuDNN 9.2.0:

cuDNN documentation prior to version 8.5.0 will be removed from the web.. Refer to the Documentation Archived topic for access to previously released documentation.

cuDNN 9.1.1

These are the NVIDIA cuDNN 9.1.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Key Features and Enhancements

The following features and enhancements have been added to this release:

Expanded the support for FP8 fused flash attention fprop on NVIDIA Hopper GPUs to a hidden dimension of up to 256.

Fixed Issues

The following issues have been fixed in this release:

On NVIDIA Hopper and Ampere architectures, fused flash attention bprop would return an execution failed when sequence lengths greater than or equal to 512k were used. This issue has been fixed.
Fixed heuristics regression in version 9.1.0 with mixed input precision matmul/convolution use cases.
On NVIDIA Hopper architectures, fused flash attention bprop could lead to incorrect results in version 8.9.7; we missed documenting the issue in that version. This issue was fixed in version 9.0.0.
The scale and bias tensor descriptor checks for Layer and RMS Normalization have been updated to require the normalizing dimensions to be specified in the correct order. These checks were missing in cuDNN version 9.0 and prior versions.

Known Issues

CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 67 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 51 for convolution forward with bias and activation (ConvBiasAct operation) may fail race-check testing when the library is tested under compute-sanitizer.
Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for Norm Backward operations with cudnnBackendNormMode_t set to CUDNN_RMS_NORM is not CUDA minor version compatible for toolkit 12.2 and beyond. Users of this engine should install the updated driver that ships with the toolkit.
It is known that cudnnNanPropagation_t may not be respected in conv-bias-relu fusions.
There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND, or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
Samples must be installed in a writable location, otherwise the samples can crash.
RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

Deprecated and Removed Features

The following features are removed in cuDNN 9.1.1:

cuDNN documentation prior to version 8.5.0 will be removed from the web.. Refer to the Documentation Archived topic for access to previously released documentation.

cuDNN 9.1.0

These are the NVIDIA cuDNN 9.1.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

Frame pointers have been enabled which allows for better runtime visibility and traceability, and allows for easier exchange of runtime information with NVIDIA when needed for debugging purposes. Refer to the cuDNN Developer Guide for more information on how to symbolize stack traces with obfuscated symbols from the cuDNN symbol server.

Key Features and Enhancements

The following features and enhancements have been added to this release:

cuDNN Fused Flash Attention support has been expanded to FP8 datatypes for NVIDIA Hopper GPUs. For more information, refer to the cuDNN Developer Guide.
cuDNN BF16 and FP16 Fused Flash Attention now supports embedding dim = 256 use cases in forward propagation.
Expanded support of FP16 and BF16 Fused Flash Attention by adding the sliding window attention feature on NVIDIA Ampere and Hopper GPUs. For more information, refer to the cuDNN Developer Guide.
Improved single op matmul (that is, unfused) performance with a new engine that can dispatch calls to cuBLASLt when it supports the given matmul parameters.
The cudnnGetRNNWeightParams() function was improved without changing its prototype. You can now make two separate calls to cudnnGetRNNWeightParams():
- one invocation to retrieve weight matrix shapes (bDesc=NULL, bAddr=NULL),
- one invocation to retrieve bias dimensions (mDesc=NULL, mAddr=NULL)
This calling pattern resembles two dedicated RNN APIs as in cuDNN 7.x: cudnnGetRNNLinLayerMatrixParams() and cudnnGetRNNLinLayerBiasParams(). The cudnnGetRNNWeightParams() function permits the weightSpace argument to be NULL. This way, you can retrieve weight/bias offsets instead of actual pointers within the buffer specified by the weightSpace address.
Elementwise affine operations (Scale, DScale, Bias, DBias) are now optional for both forward and backward passes in LayerNorm and RMSNorm.

Fixed Issues

The following issues have been fixed in this release:

On NVIDIA Hopper architectures, incorrect results were possible in Fused Flash Attention with packed layout when the embedding dimension per head for query and value were different. This issue has been fixed.
When using the cuDNN static library, you had to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. This restriction has been lifted for the 12.x build, which now allows you to use any minor version of CUDA Toolkit 12. The 11.x build still has this restriction, which is documented in the cuDNN Support Matrix.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights was observing a performance regression when moving from cuDNN 8.8 to cuDNN 8.9. The regression has been fixed in cuDNN 9.0.0.
Max pooling for INT8 returned -127 when the values in the windows were all equal to -128. This issue is now resolved as of this release.

Known Issues

Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for Norm Backward operations with cudnnBackendNormMode_t set to CUDNN_RMS_NORM is not CUDA minor version compatible for toolkit 12.2 and beyond. Users of this engine should install the updated driver that ships with the toolkit.
It is known that cudnnNanPropagation_t may not be respected in conv-bias-relu fusions.
There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND, or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
Samples must be installed in a writable location, otherwise the samples can crash.
RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

Deprecated and Removed Features

The following features are removed in cuDNN 9.1.0:

ppc64le is no longer supported. The last release to support ppc64le was 9.0.0.
RHEL7 is no longer supported. The last release to support RHEL7 was 9.0.0.
cuDNN documentation prior to version 8.5.0 will be removed in an upcoming release.

cuDNN 9.0.0

These are the NVIDIA cuDNN 9.0.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.

Announcements

This is the first major version bump of cuDNN in almost 4 years. There are some exciting new features and also some changes that may be disruptive to current applications built against prior versions of cuDNN. This section provides more details.
The cuDNN library is reorganized into several sub-libraries, which, in a future cuDNN version, will allow for more flexibility in loading selected parts of the cuDNN library. For more information, refer to the API Overview.
For a list of added, deprecated, and removed APIs, refer to API Changes for cuDNN 9.0.0.
cuDNN no longer depends on the cuBLAS library; instead cuDNN now depends on the cuBLASLt library for certain primitive linear algebra operators.
The definition of CUDNN_VERSION has been changed to CUDNN_MAJOR * 10000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL from CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL. Refer to Version Checking Against CUDNN_VERSION in the cuDNN Developer Guide.
cuDNN now has RPM and Debian meta-packages available for easy installation.
sudo apt-get install -y cudnn
This command installs the latest available cuDNN for the latest available CUDA version. Refer to the cuDNN Installation Guide for more details.
Starting with cuDNN 9.0.0, an important subset of operation graphs are hardware forward compatible. cuDNN 9.0.0 and subsequent releases will work on all current and future GPU architectures subject to specific constraints as documented in the cuDNN Developer Guide.

Key Features and Enhancements

The following features and enhancements have been added to this release:

The cuDNN backend API uses less memory for many execution plans which should be beneficial for users who cache execution plans.
FP16 and BF16 fused flash attention engine performance has been significantly improved for NVIDIA GPUs:
- Speed-up of up to 50% over cuDNN 8.9.7 on Hopper GPUs.
- Speed-up of up to 100% over cuDNN 8.9.7 on Ampere GPUs.
Expanded support of FP16 and BF16 flash attention by adding the gradient for relative positional encoding on NVIDIA Ampere GPUs.
The fusion engine enables pointwise operations in the mainloop to be fused on both input A and B for matmul. The fused pointwise operation can be either a scalar, row, column broadcast, or a full tensor pointwise operation. Mixed precision is also supported for both input A and B. This new feature is only available for NVIDIA Hopper GPUs.
Updated the cuDNN Graph API execution plan serialization JSON schema to version 3.
Introduced more specific error codes and error categories (BAD_PARAM, NOT_SUPPORTED, INTERNAL_ERROR, EXECUTION_FAILED) which helps checking errors in these two levels of granularities. A macro CUDNN_ERROR_CATEGORY is introduced for extracting the error category from a specific error code.
Introduced nested logging levels, by setting CUDNN_LOGLEVEL_DBG, where the more severe levels are included by enabling the less severe levels. This better adheres to common practices and increases error reporting visibility. cuDNN version 8.x logging environment variables CUDNN_LOGERR_DBG, CUDNN_LOGWARN_DBG, and CUDNN_LOGINFO_DBG are deprecated and will continue to work during the cuDNN version 9.x grace period for compatibility.
Introduced cudnnGetLastErrorString API to fetch the latest error message.
The thread-safety of cuDNN is notably improved in this release. Concurrent execution of execution plans is now supported

Fixed Issues

On NVIDIA Ampere and Hopper architectures, invalid memory accesses were possible in variable sequence lengths and when the padded sequence length was not a multiple of 64. This issue has been fixed.
On NVIDIA Ampere and Hopper architectures, incorrect results were possible when the sequence length for query was less than 64. This issue has been fixed.
Fixed an accuracy issue in which FP32 input data is truncated instead of rounded into TF32 data on the NVIDIA Hopper fusion engine.
Previously on Linux, when cuDNN would load one of its other sub-libraries, it might attempt to load a mismatched version of cuDNN, possibly causing an application error. This issue has been fixed; it will look for the library file with complete version suffix first, and fall back to more generic version suffixes.
Fixed a serialization issue when a deserialized execution plan produced wrong results due to passing kernel parameters incorrectly.
For the ConvBNFusion operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 2 had a race condition for some problem sets. This issue was fixed in cuDNN version 8.9.6.
The NaN propagation is guaranteed under the CUDNN_PROPAGATE_NAN mode.

Known Issues

Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for Norm Backward operations with cudnnBackendNormMode_t set to CUDNN_RMS_NORM is not CUDA minor version compatible for toolkit 12.2 and beyond. Users of this engine should install the updated driver that ships with the toolkit.
It is known that cudnnNanPropagation_t may not be respected in conv-bias-relu fusions.
There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the ConvBiasAct operation, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39 may return CUDNN_STATUS_EXECUTION_FAILED when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the “compute-sanitizer” tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. Extensive testing on H100, with different clock speeds and computational loads, did not reveal any impact on functional results that were always identical and correct. This issue is currently being investigated.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode CUDNN_POINTWISE_LOGICAL_NOT, CUDNN_POINTWISE_LOGICAL_AND, or CUDNN_POINTWISE_LOGICAL_OR operates on boolean tensors.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57 for convolution forward and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62 for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD.
With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0 for DgradDreluBnBwdWeights may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode CUDNN_POOLING_AVG might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
Convolutions (ConvolutionForward, ConvolutionBackwardData, and ConvolutionBackwardFilter) may experience performance regressions when run with math type CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION on CUDNN_DATA_FLOAT data (input and output).
Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.

Compatibility

For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.

Limitations

Disabling CUDA context preemption on Windows can sometimes lead to CUDNN_INTERNAL_ERRORS being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the cuDNN Support Matrix for the exact supported CUDA versions.
In cuDNN 8.9.0, runtime fusion engines (with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1 for convolution backwards data (which is part of legacy CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of CUDNN_STATUS_NOT_SUPPORTED or CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING will be returned.
Samples must be installed in a writable location, otherwise the samples can crash.
RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.

When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the CUBLAS_WORKSPACE_CONFIG environmental variable, for example, :16:8 or :4096:2.

The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is :16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the :16:8 non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the _ALGO_0 algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x, W == (R-1) * dilationW || H == (S-1) * dilationH cases are no longer supported.
In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob CUDNN_KNOB_TYPE_USE_TEX to 1 for engines that support texture-based load instructions.
In the backend API, convolution forward engine with CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1 is not supported when the product (channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
cudnnSpatialTfSamplerBackward() returns CUDNN_STATUS_NOT_SUPPORTED when the number of channels exceeds 1024.
When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return CUDNN_STATUS_ARCH_MISMATCH instead. The affected APIs include:
cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer, cuGetProcAddress failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the --report-api-errors no option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.

Deprecated and Removed Features

The following features are deprecated in cuDNN 9.0.0:

For a list of deprecated and removed APIs, refer to API Changes for cuDNN 9.0.0.