For previously released cuDNN documentation, refer to the cuDNN Documentation Archives.
These are the NVIDIA cuDNN 9.0.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.
This is the first major version bump of cuDNN in almost 4 years. There are some exciting new features and also some changes that may be disruptive to current applications built against prior versions of cuDNN. This section provides more details.
The cuDNN library is reorganized into several sub-libraries, which, in a future cuDNN version, will allow for more flexibility in loading selected parts of the cuDNN library. For more information, refer to the API Overview.
For a list of added, deprecated, and removed APIs, refer to API Changes for cuDNN 9.0.0.
cuDNN no longer depends on the cuBLAS library; instead cuDNN now depends on the cuBLASLt library for certain primitive linear algebra operators.
The definition of
CUDNN_VERSIONhas been changed to
CUDNN_MAJOR * 10000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVELfrom
CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL. Refer to Version Checking Against CUDNN_VERSION in the cuDNN Developer Guide.
cuDNN now has RPM and Debian meta-packages available for easy installation.
sudo apt-get install -y cudnn
This command installs the latest available cuDNN for the latest available CUDA version. Refer to the cuDNN Installation Guide for more details.
Starting with cuDNN 9.0.0, an important subset of operation graphs are hardware forward compatible. cuDNN 9.0.0 and subsequent releases will work on all current and future GPU architectures subject to specific constraints as documented in the cuDNN Developer Guide.
Key Features and Enhancements
The following features and enhancements have been added to this release:
The cuDNN backend API uses less memory for many execution plans which should be beneficial for users who cache execution plans.
FP16 and BF16 fused flash attention engine performance has been significantly improved for NVIDIA GPUs:
Speed-up of up to 50% over cuDNN 8.9.7 on Hopper GPUs.
Speed-up of up to 100% over cuDNN 8.9.7 on Ampere GPUs.
Expanded support of FP16 and BF16 flash attention by adding the gradient for relative positional encoding on NVIDIA Ampere GPUs.
The fusion engine enables pointwise operations in the mainloop to be fused on both input A and B for MatMul. The fused pointwise operation can be either a scalar, row, column broadcast, or a full tensor pointwise operation. Mixed precision is also supported for both input A and B. This new feature is only available for NVIDIA Hopper GPUs.
Updated the cuDNN Graph API execution plan serialization JSON schema to version 3.
Introduced more specific error codes and error categories (
EXECUTION_FAILED) which helps checking errors in these two levels of granularities. A macro
CUDNN_ERROR_CATEGORYis introduced for extracting the error category from a specific error code.
Introduced nested logging levels, by setting
CUDNN_LOGLEVEL_DBG, where the more severe levels are included by enabling the less severe levels. This better adheres to common practices and increases error reporting visibility. cuDNN version 8.x logging environment variables
CUDNN_LOGINFO_DBGare deprecated and will continue to work during the cuDNN version 9.x grace period for compatibility.
cudnnGetLastErrorStringAPI to fetch the latest error message.
The thread-safety of cuDNN is notably improved in this release. Concurrent execution of execution plans is now supported
On NVIDIA Ampere and Hopper architectures, invalid memory accesses were possible in variable sequence lengths and when the padded sequence length was not a multiple of 64. This issue has been fixed.
On NVIDIA Ampere and Hopper architectures, incorrect results were possible when the sequence length for query was less than 64. This issue has been fixed.
Fixed an accuracy issue in which FP32 input data is truncated instead of rounded into TF32 data on the NVIDIA Hopper fusion engine.
Previously on Linux, when cuDNN would load one of its other sub-libraries, it might attempt to load a mismatched version of cuDNN, possibly causing an application error. This issue has been fixed; it will look for the library file with complete version suffix first, and fall back to more generic version suffixes.
Fixed a serialization issue when a deserialized execution plan produced wrong results due to passing kernel parameters incorrectly.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 2had a race condition for some problem sets. This issue was fixed in cuDNN version 8.9.6.
The NaN propagation is guaranteed under the
Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3for Norm Backward operations with
CUDNN_RMS_NORMis not CUDA minor version compatible for toolkit 12.2 and beyond. Users of this engine should install the updated driver that ships with the toolkit.
It is known that
cudnnNanPropagation_tmay not be respected in conv-bias-relu fusions.
There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39may return
CUDNN_STATUS_EXECUTION_FAILEDwhen running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.
ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the “compute-sanitizer” tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. Extensive testing on H100, with different clock speeds and computational loads, did not reveal any impact on functional results that were always identical and correct. This issue is currently being investigated.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode
CUDNN_POINTWISE_LOGICAL_ORoperates on boolean tensors.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57for convolution forward and
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.
With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0for
DgradDreluBnBwdWeightsmay see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode
CUDNN_POOLING_AVGmight output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().
ConvolutionBackwardFilter) may experience performance regressions when run with math type
CUDNN_DATA_FLOATdata (input and output).
Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.
Disabling CUDA context preemption on Windows can sometimes lead to
CUDNN_INTERNAL_ERRORSbeing returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.
When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the cuDNN Support Matrix for the exact supported CUDA versions.
In cuDNN 8.9.0, runtime fusion engines (with
CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.
The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25for convolution backwards data (which is part of legacy
CUDNN_CONVOLUTION_BWD_DATA_ALGO_0) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1for convolution backwards data (which is part of legacy
CUDNN_CONVOLUTION_BWD_DATA_ALGO_1) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.
The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of
CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSINGwill be returned.
Samples must be installed in a writable location, otherwise the samples can crash.
RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.
When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the
CUBLAS_WORKSPACE_CONFIGenvironmental variable, for example,
:16:8 or :4096:2.
The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is
:16:8:4096:2, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the
:16:8non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
_ALGO_0algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.
In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to
W >= (R-1) * dilationW && H >= (S-1) * dilationH, whereas, in cuDNN v8.0.x,
W == (R-1) * dilationW || H == (S-1) * dilationHcases are no longer supported.
In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable
CUDNN_TEXOFF_DBG. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knob
CUDNN_KNOB_TYPE_USE_TEXto 1 for engines that support texture-based load instructions.
In the backend API, convolution forward engine with
CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1is not supported when the product (
channels * height * width) of the input image exceeds 536,870,912 which is 2^29.
CUDNN_STATUS_NOT_SUPPORTEDwhen the number of channels exceeds 1024.
When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return
CUDNN_STATUS_ARCH_MISMATCHinstead. The affected APIs include:
cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer,
cuGetProcAddressfailures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the
--report-api-errors nooption, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.
The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.
Deprecated and Removed Features
The following features are deprecated in cuDNN 9.0.0:
For a list of deprecated and removed APIs, refer to API Changes for cuDNN 9.0.0.