Release Notes
To review cuDNN documentation versions 8.5.0 - 8.6.7, refer to the cuDNN Documentation Archives.
To review cuDNN documentation 9.0.0 and more recent, choose a version from the bottom left navigation selector toggle.
cuDNN 9.4.0
These are the NVIDIA cuDNN 9.4.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.
Key Features and Enhancements
The following features and enhancements have been added to this release:
Paged attention is now enabled by the newly introduced
paged_cache_load
operation, which is supported by Ampere and Hopper GPUs. This operation can be combined with the existing flashfprop
attention kernel. Paged attention allows for more efficient memory usage by storing K/V caches in non-contiguous memory, and using page tables to reconstruct them. For more information, refer to the Developer Guide, API Reference, and the Paged Attention paper.Improved performance for scaled dot product attention on Hopper. Speedup may vary from 5% to 30% depending on layer shape and input data type.
Function names of attention kernels have been enhanced with more details on instruction and kernel type. For example,
cudnn_generated_fort_native_sdpa_sm80_flash_fprop_wmma_f16_knob_32_64x64x64_4x1x1_kernel0_0
.Performance improvements for scaled dot product attention with variable sequence length on Hopper. Kernel execution times for forward pass are now proportional to actual sequence lengths of query as opposed to maximum sequence lengths of query in earlier versions.
Dynamic shape with kernel cache is added as a new feature to significantly reduce execution plan finalizing time for use cases that have same-topology dynamic shape operation graph. This is done by binding the previously compiled applicable kernel to the execution plan instead of re-compiling a new one from scratch. Refer to
CUDNN_ATTR_EXECUTION_PLAN_KERNEL_CACHE
andCUDNN_ATTR_OPERATIONGRAPH_IS_DYNAMIC_SHAPE_ENABLED
from the API Reference for more details.Performance for matrix multiplication with certain fusions through the cuBLASLt engine for E4M3 and E5M2 data types has been enhanced. The newly added coverage for the engine includes:
Support for GEMM alpha and beta scale factors.
Epilogue fusion for Bias with ReLu and GeLu, and with the ReLu or GeLu auxiliary tensor output required for the backward pass.
Fusion with Bias gradients for input matrices A or B.
Runtime fusion engine performance improvements for matrix multiplication with large K dimension on NVIDIA’s Ampere, Ada, and Hopper GPUs.
Added zero centered gamma support for layer norm and RMS norm.
Fixed Issues
Scaled dot-product attention (SDPA) numerics enhancements by using more accurate math in the softmax part of the kernel.
Graphs containing norm forward inference operation would fail to serialize in cuDNN 9.3. This has been fixed in cuDNN 9.4.
Known Issues
Generic runtime fusion engine support surface 70 does not support
conv
,dgrad
, andwgrad
filters with spatial dimensions larger than 32.The FP16 and BF16 scaled dot-product attention (SDPA) engine with variable sequence length has a regression from cuDNN version 9.3.0 where enabling zero-sequence-length values results in an illegal instruction error.
For some
ConvolutionBwdFilter
depthwise convolutional workloads, cuDNN may generate racecheck hazard warnings. This issue exists in previous v9 releases and will be fixed in a future release.
Compatibility
For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.
cuDNN 9.3.0
These are the NVIDIA cuDNN 9.3.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.
Announcements
The cudnnBackendInitialize() function has been marked deprecated.
Key Features and Enhancements
The following features and enhancements have been added to this release:
The cuDNN v8 convolution API has been extended to support tensors with a batch size larger than 2 Giga-elements.
FP16 and BF16 scaled-dot-product attention (SDPA) with variable sequence length supports zero-sequence-length values. This feature enables the use of dynamic batch sizes without the need for recompilation. cuDNN performs a no-op for a batch when the query sequence length and the key/value sequence length are both zero.
Support for SM Carveout has been extended to the backend API for batch norm on NVIDIA Ampere and Hopper GPUs.
Error messages generated during retrieval of the
CUDNN_ATTR_ENGINEHEUR_RESULTS
attribute are accessible through the cudnnGetLastErrorString() function.The forward compatibility of the library has been enhanced as follows:
Batch Normalization and APIs related to Normalization in the Legacy API have been made forward compatible.
RNN APIs have been made forward compatible.
Fusion patterns involving grouped convolutions, that is convolutions with G>1, have been made forward compatible for convolution forward operations.
Performance for matrix multiplication with certain fusions through the cuBLASLt engine for FP16 and BF16 data types has been enhanced. The newly added coverage for the engine includes:
Support for GEMM alpha and beta scale factors
Epilogue fusion for Bias with ReLu and GeLu, and with the ReLu or GeLu auxiliary tensor output required for the backward pass
For detailed information about supported datatypes, refer to cublasLtMatmul() in the cuBLAS documentation.
The runtime fusion engine now supports tensors that are not fully packed for matmul and convolution on the NVIDIA’s Ampere, Ada Lovelace, and Hopper architectures.
Support for serialization and deserialization of execution plans to avoid recompilation of runtime-generated kernels has been extended to support all normalization backend engines.
Layer and RMS Normalization now support optional amax when outputting tensors with FP8 data types.
Fixed Issues
The following issues have been fixed in this release:
Scaled dot-product attention (SDPA) with ragged offsets, when executed multiple times, no longer exhibits undefined behavior. Earlier, successive runs of SDPA with ragged offsets could hang or cause illegal accesses.
SDPA with ragged offsets, when used with variable sequence lengths, no longer causes invalid memory access.
Additional fixes have been made to address a numerical issue in FP16 and BF16 SDPA fprop, where the kernel would generate unexpected outputs in the softmax stats tensor for large query and key tensor values. A partial fix was released in cuDNN 9.2.1, and the latest fix offers a more complete resolution.
The cuBLASLt library in NVIDIA CUDA Toolkit 12.6 fixes the bug with the
e5m2
data type for FP8 matrix multiplication. Therefore, the cuBLASLt engine is reenabled for such cases for cuBLASLt from version 12.6 onwards.The cudnnReduceTensor() function can now correctly fetch data from the input tensor when the input tensor has the same data type as the compute type and a format other than NCHW.
Running conv-bias-act fusions with
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 14
no longer generates incorrect results when the activation mode isCUDNN_ACTIVATION_IDENTITY
, or there’s not an activation node in the computation graph.In-place operation is now allowed for the cudnnSoftmaxForward() function.
Known Issues
There are known performance regressions on several convolution models with NCHW format on Orin cards compared to cuDNN 8.9.x. For better performance, switch to NHWC format.
With the cuDNN runtime fusion engine for GEMMs, output tensors with a packed-Boolean data type might return incorrect results when the batch size is greater than one.
Some graphs containing the norm forward operation in inference mode fail to serialize.
Compatibility
For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.
Deprecated and Removed Features
The cudnnStatus_t cudnnBackendInitialize(cudnnBackendDescriptorType_t) function has been marked deprecated. This function is not used in cudnn-frontend either.
cuDNN 9.2.1
These are the NVIDIA cuDNN 9.2.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.
Key Features and Enhancements
The following features and enhancements have been added to this release:
Enhanced heuristics for mixed input matmul runtime fusions with large gemm-k dimension.
Fixed Issues
The following issues have been fixed in this release:
Fixed a numerical issue in FP16 and BF16 fused flash attention
fprop
where the kernel would generate unexpected outputs in softmax stats tensor for large query and key tensor values.Fixed a functional bug in FP8 fused flash attention
fprop
where the kernel gave the wrong results since version cuDNN 9.1.1.When the product of the convolution N, C and K dimension exceeds the max value of
int32_t
,CUDNN_CONVOLUTION_BWD_FILTER_1X1_AS_GEMM_ENGINE
engine could fail with out-of-bounds memory access. This issue has been fixed.
Known Issues
The
ConvBNwgrad
pre-compiled engine does not supportCooperative Group Array = 3*3, 1*9 and 9*1
on NVIDIA Hopper.CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 67
for convolution forward andCUDNN_ATTR_ENGINE_GLOBAL_INDEX = 51
for convolution forward with bias and activation (ConvBiasAct
operation) may fail race-check testing when the library is tested under compute-sanitizer.Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3
for Norm Backward operations withcudnnBackendNormMode_t
set toCUDNN_RMS_NORM
is not CUDA minor version compatible for toolkits 12.2 and 12.3. Users of this engine should install the updated driver that ships with the toolkit.It is known that
cudnnNanPropagation_t
may not be respected in conv-bias-relu fusions.There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the
ConvBiasAct
operation,CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39
may returnCUDNN_STATUS_EXECUTION_FAILED
when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API
cudnnMultiHeadAttnForward()
on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode
CUDNN_POINTWISE_LOGICAL_NOT
,CUDNN_POINTWISE_LOGICAL_AND
, orCUDNN_POINTWISE_LOGICAL_OR
operates on boolean tensors.CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57
for convolution forward andCUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62
for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD
.With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode
CUDNN_POOLING_AVG
might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().Convolutions (
ConvolutionForward
,ConvolutionBackwardData
, andConvolutionBackwardFilter
) may experience performance regressions when run with math typeCUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
onCUDNN_DATA_FLOAT
data (input and output).Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
The support for FP8 matrix multiplication through the cuBLASLt library path is restricted only to e4m3 data types due to an existing bug in cuBLASLt available with the NVIDIA CUDA Toolkit 12.5. The support for e5m2 data type will be added in a future cuDNN version after the bug in cuBLASLt is fixed.
Compatibility
For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.
Limitations
Disabling CUDA context preemption on Windows can sometimes lead to
CUDNN_INTERNAL_ERRORS
being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.In cuDNN 8.9.0, runtime fusion engines (with
CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION
) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25
for convolution backwards data (which is part of legacyCUDNN_CONVOLUTION_BWD_DATA_ALGO_0
) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1
for convolution backwards data (which is part of legacyCUDNN_CONVOLUTION_BWD_DATA_ALGO_1
) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of
CUDNN_STATUS_NOT_SUPPORTED
orCUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING
will be returned.Samples must be installed in a writable location, otherwise the samples can crash.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the
_ALGO_0
algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to
W >= (R-1) * dilationW && H >= (S-1) * dilationH
, whereas, in cuDNN v8.0.x,W == (R-1) * dilationW || H == (S-1) * dilationH
cases are no longer supported.In the backend API, convolution forward engine with
CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1
is not supported when the product (channels * height * width
) of the input image exceeds 536,870,912 which is 2^29.cudnnSpatialTfSamplerBackward() returns
CUDNN_STATUS_NOT_SUPPORTED
when the number of channels exceeds 1024.When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return
CUDNN_STATUS_ARCH_MISMATCH
instead. The affected APIs include:When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer,
cuGetProcAddress
failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the--report-api-errors no
option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.
cuDNN 9.2.0
These are the NVIDIA cuDNN 9.2.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.
Announcements
Introducing the
GRAPH_JIT_ONLY
configuration of cuDNN, which includes all JIT engines while dropping engines based on precompiled kernels. This configuration dramatically reduces the required binary size. It supports NVIDIA Ampere and later GPUs, covering a large subset of the graph API. Refer to the cuDNN Library Configuration section of the cuDNN Developer Guide for more information.
Key Features and Enhancements
The following features and enhancements have been added to this release:
Support for activation functions like
CUDNN_POINTWISE_TANH_FWD
has been added to fused flash attentionfprop
andbprop
on Hopper GPUs. This support has also been added in fused flash attentionfprop
on Ampere GPUs. For more information, refer to the cuDNN Developer Guide.Support for matrix multiplication through the cuBLASLt library in cuDNN is extended to FP8 data types. Prior to this release, the support was restricted to FP16 data types. For cuDNN graphs with FP8 data types and associated quantization and dequantization scales, if the graph is supported by cuBLASLt, cuDNN heuristics will return the engine using the cuBLASLt library first in the list. This support through cuBLASLt doesn’t exist for the
GRAPH_JIT_ONLY
configuration.Added returned
cudnnStatus_t
error codes logging to allcudnnBackend*()
APIs in theINFO
level logging.Added logging of visualizable graph information on
cudnnBackendFinalize()
forCUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR
andCUDNN_BACKEND_ENGINECFG_DESCRIPTOR
inINFO
level logging.You may now check whether a workload is supported by finalizing an
EngineConfig
; previously it was required to finalize anExecutionPlan
(with theEngineConfig
) to determine support. Additionally,EngineConfigs
returned by heuristics queries are now guaranteed to be executable.In the runtime fusion engine:
ConvolutionBwdFilter
andConvolutionBwdData
with FP32 and FP8 input data type now support NHWC input tensor layout on NVIDIA Hopper.Optimized the performance for the
ConvolutionFwd
operation on use cases with narrow channel size on NVIDIA Ampere, NVIDIA Ada Lovelace, and Hopper.Relaxed the input tensor alignment requirement from 128-bit to 32-bit for all the
Matmul
andConvolutionFwd
use cases with mainloop fusion on Ampere and Ada.Relaxed the output alignment requirement down to 8-bit on Ampere, Ada, and Hopper.
Fused Batch Normalization + ReLU graph pattern now support specifying ReLU with lower and upper clips.
Layer and RMS Normalization now support FP8 data types for output tensors.
Fixed Issues
The following issues have been fixed in this release:
Layer and RMS Normalization now support 2D/3D tensors and attempt to infer normalizing dimensions from input tensor shapes. If inference is not possible, the engines default to normalizing over the last dimension for 2D/3D tensors and over all but the first dimension for 4D/5D tensors.
cudnnCTCLoss()
now supports execution under graph capture mode.Runtime engines (with
CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION
) for Layer and RMS Normalization forward could compile slowly when dealing with large hidden sizes in previous releases. This slow compilation issue has been fixed in this release.
Known Issues
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 67
for convolution forward andCUDNN_ATTR_ENGINE_GLOBAL_INDEX = 51
for convolution forward with bias and activation (ConvBiasAct
operation) may fail race-check testing when the library is tested under compute-sanitizer.Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3
for Norm Backward operations withcudnnBackendNormMode_t
set toCUDNN_RMS_NORM
is not CUDA minor version compatible for toolkits 12.2 and 12.3. Users of this engine should install the updated driver that ships with the toolkit.It is known that
cudnnNanPropagation_t
may not be respected in conv-bias-relu fusions.There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the
ConvBiasAct
operation,CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39
may returnCUDNN_STATUS_EXECUTION_FAILED
when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API
cudnnMultiHeadAttnForward()
on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode
CUDNN_POINTWISE_LOGICAL_NOT
,CUDNN_POINTWISE_LOGICAL_AND
, orCUDNN_POINTWISE_LOGICAL_OR
operates on boolean tensors.CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57
for convolution forward andCUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62
for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD
.With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode
CUDNN_POOLING_AVG
might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().Convolutions (
ConvolutionForward
,ConvolutionBackwardData
, andConvolutionBackwardFilter
) may experience performance regressions when run with math typeCUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
onCUDNN_DATA_FLOAT
data (input and output).Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
The support for FP8 matrix multiplication through the cuBLASLt library path is restricted only to e4m3 data types due to an existing bug in cuBLASLt available with the NVIDIA CUDA Toolkit 12.5. The support for e5m2 data type will be added in a future cuDNN version after the bug in cuBLASLt is fixed.
Compatibility
For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.
Limitations
Disabling CUDA context preemption on Windows can sometimes lead to
CUDNN_INTERNAL_ERRORS
being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.In cuDNN 8.9.0, runtime fusion engines (with
CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION
) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25
for convolution backwards data (which is part of legacyCUDNN_CONVOLUTION_BWD_DATA_ALGO_0
) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1
for convolution backwards data (which is part of legacyCUDNN_CONVOLUTION_BWD_DATA_ALGO_1
) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of
CUDNN_STATUS_NOT_SUPPORTED
orCUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING
will be returned.Samples must be installed in a writable location, otherwise the samples can crash.
The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the
_ALGO_0
algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to
W >= (R-1) * dilationW && H >= (S-1) * dilationH
, whereas, in cuDNN v8.0.x,W == (R-1) * dilationW || H == (S-1) * dilationH
cases are no longer supported.In the backend API, convolution forward engine with
CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1
is not supported when the product (channels * height * width
) of the input image exceeds 536,870,912 which is 2^29.cudnnSpatialTfSamplerBackward() returns
CUDNN_STATUS_NOT_SUPPORTED
when the number of channels exceeds 1024.When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return
CUDNN_STATUS_ARCH_MISMATCH
instead. The affected APIs include:When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer,
cuGetProcAddress
failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the--report-api-errors no
option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.
Deprecated and Removed Features
The following features are removed in cuDNN 9.2.0:
cuDNN documentation prior to version 8.5.0 will be removed from the web.. Refer to the Documentation Archived topic for access to previously released documentation.
cuDNN 9.1.1
These are the NVIDIA cuDNN 9.1.1 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.
Key Features and Enhancements
The following features and enhancements have been added to this release:
Expanded the support for FP8 fused flash attention
fprop
on NVIDIA Hopper GPUs to a hidden dimension of up to 256.
Fixed Issues
The following issues have been fixed in this release:
On NVIDIA Hopper and Ampere architectures, fused flash attention
bprop
would return an execution failed when sequence lengths greater than or equal to 512k were used. This issue has been fixed.Fixed heuristics regression in version 9.1.0 with mixed input precision matmul/convolution use cases.
On NVIDIA Hopper architectures, fused flash attention bprop could lead to incorrect results in version 8.9.7; we missed documenting the issue in that version. This issue was fixed in version 9.0.0.
The scale and bias tensor descriptor checks for Layer and RMS Normalization have been updated to require the normalizing dimensions to be specified in the correct order. These checks were missing in cuDNN version 9.0 and prior versions.
Known Issues
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 67
for convolution forward andCUDNN_ATTR_ENGINE_GLOBAL_INDEX = 51
for convolution forward with bias and activation (ConvBiasAct
operation) may fail race-check testing when the library is tested under compute-sanitizer.Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3
for Norm Backward operations withcudnnBackendNormMode_t
set toCUDNN_RMS_NORM
is not CUDA minor version compatible for toolkit 12.2 and beyond. Users of this engine should install the updated driver that ships with the toolkit.It is known that
cudnnNanPropagation_t
may not be respected in conv-bias-relu fusions.There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the
ConvBiasAct
operation,CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39
may returnCUDNN_STATUS_EXECUTION_FAILED
when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API
cudnnMultiHeadAttnForward()
on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode
CUDNN_POINTWISE_LOGICAL_NOT
,CUDNN_POINTWISE_LOGICAL_AND
, orCUDNN_POINTWISE_LOGICAL_OR
operates on boolean tensors.CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57
for convolution forward andCUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62
for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD
.With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode
CUDNN_POOLING_AVG
might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().Convolutions (
ConvolutionForward
,ConvolutionBackwardData
, andConvolutionBackwardFilter
) may experience performance regressions when run with math typeCUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
onCUDNN_DATA_FLOAT
data (input and output).Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
Compatibility
For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.
Limitations
Disabling CUDA context preemption on Windows can sometimes lead to
CUDNN_INTERNAL_ERRORS
being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.In cuDNN 8.9.0, runtime fusion engines (with
CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION
) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25
for convolution backwards data (which is part of legacyCUDNN_CONVOLUTION_BWD_DATA_ALGO_0
) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1
for convolution backwards data (which is part of legacyCUDNN_CONVOLUTION_BWD_DATA_ALGO_1
) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of
CUDNN_STATUS_NOT_SUPPORTED
orCUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING
will be returned.Samples must be installed in a writable location, otherwise the samples can crash.
RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.
When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the
CUBLAS_WORKSPACE_CONFIG
environmental variable, for example,:16:8 or :4096:2
.The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is
:16:8:4096:2
, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the:16:8
non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the
_ALGO_0
algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to
W >= (R-1) * dilationW && H >= (S-1) * dilationH
, whereas, in cuDNN v8.0.x,W == (R-1) * dilationW || H == (S-1) * dilationH
cases are no longer supported.In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable
CUDNN_TEXOFF_DBG
. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knobCUDNN_KNOB_TYPE_USE_TEX
to 1 for engines that support texture-based load instructions.In the backend API, convolution forward engine with
CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1
is not supported when the product (channels * height * width
) of the input image exceeds 536,870,912 which is 2^29.cudnnSpatialTfSamplerBackward() returns
CUDNN_STATUS_NOT_SUPPORTED
when the number of channels exceeds 1024.When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return
CUDNN_STATUS_ARCH_MISMATCH
instead. The affected APIs include:cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer,
cuGetProcAddress
failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the--report-api-errors no
option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.
Deprecated and Removed Features
The following features are removed in cuDNN 9.1.1:
cuDNN documentation prior to version 8.5.0 will be removed from the web.. Refer to the Documentation Archived topic for access to previously released documentation.
cuDNN 9.1.0
These are the NVIDIA cuDNN 9.1.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.
Announcements
Frame pointers have been enabled which allows for better runtime visibility and traceability, and allows for easier exchange of runtime information with NVIDIA when needed for debugging purposes. Refer to the cuDNN Developer Guide for more information on how to symbolize stack traces with obfuscated symbols from the cuDNN symbol server.
Key Features and Enhancements
The following features and enhancements have been added to this release:
cuDNN Fused Flash Attention support has been expanded to FP8 datatypes for NVIDIA Hopper GPUs. For more information, refer to the cuDNN Developer Guide.
cuDNN BF16 and FP16 Fused Flash Attention now supports embedding
dim = 256
use cases in forward propagation.Expanded support of FP16 and BF16 Fused Flash Attention by adding the sliding window attention feature on NVIDIA Ampere and Hopper GPUs. For more information, refer to the cuDNN Developer Guide.
Improved single op
matmul
(that is, unfused) performance with a new engine that can dispatch calls to cuBLASLt when it supports the givenmatmul
parameters.The
cudnnGetRNNWeightParams()
function was improved without changing its prototype. You can now make two separate calls tocudnnGetRNNWeightParams()
:one invocation to retrieve weight matrix shapes (
bDesc=NULL
,bAddr=NULL
),one invocation to retrieve bias dimensions (
mDesc=NULL
,mAddr=NULL
)
This calling pattern resembles two dedicated RNN APIs as in cuDNN 7.x:
cudnnGetRNNLinLayerMatrixParams()
andcudnnGetRNNLinLayerBiasParams()
. ThecudnnGetRNNWeightParams()
function permits theweightSpace
argument to beNULL
. This way, you can retrieve weight/bias offsets instead of actual pointers within the buffer specified by theweightSpace
address.Elementwise
affine operations (Scale
,DScale
,Bias
,DBias
) are now optional for both forward and backward passes inLayerNorm
andRMSNorm
.
Fixed Issues
The following issues have been fixed in this release:
On NVIDIA Hopper architectures, incorrect results were possible in Fused Flash Attention with packed layout when the embedding dimension per head for query and value were different. This issue has been fixed.
When using the cuDNN static library, you had to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. This restriction has been lifted for the 12.x build, which now allows you to use any minor version of CUDA Toolkit 12. The 11.x build still has this restriction, which is documented in the cuDNN Support Matrix.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0
forDgradDreluBnBwdWeights
was observing a performance regression when moving from cuDNN 8.8 to cuDNN 8.9. The regression has been fixed in cuDNN 9.0.0.Max pooling for INT8 returned -127 when the values in the windows were all equal to -128. This issue is now resolved as of this release.
Known Issues
Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3
for Norm Backward operations withcudnnBackendNormMode_t
set toCUDNN_RMS_NORM
is not CUDA minor version compatible for toolkit 12.2 and beyond. Users of this engine should install the updated driver that ships with the toolkit.It is known that
cudnnNanPropagation_t
may not be respected in conv-bias-relu fusions.There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the
ConvBiasAct
operation,CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39
may returnCUDNN_STATUS_EXECUTION_FAILED
when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the compute-sanitizer tool in some cuBLAS kernels invoked by the cuDNN multihead attention API
cudnnMultiHeadAttnForward()
on H100 GPUs. This issue happens with the CUDA Toolkit 12.2 but has been addressed in CUDA Toolkit 12.3.cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode
CUDNN_POINTWISE_LOGICAL_NOT
,CUDNN_POINTWISE_LOGICAL_AND
, orCUDNN_POINTWISE_LOGICAL_OR
operates on boolean tensors.CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57
for convolution forward andCUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62
for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD
.With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode
CUDNN_POOLING_AVG
might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().Convolutions (
ConvolutionForward
,ConvolutionBackwardData
, andConvolutionBackwardFilter
) may experience performance regressions when run with math typeCUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
onCUDNN_DATA_FLOAT
data (input and output).Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
Compatibility
For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.
Limitations
Disabling CUDA context preemption on Windows can sometimes lead to
CUDNN_INTERNAL_ERRORS
being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.In cuDNN 8.9.0, runtime fusion engines (with
CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION
) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25
for convolution backwards data (which is part of legacyCUDNN_CONVOLUTION_BWD_DATA_ALGO_0
) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1
for convolution backwards data (which is part of legacyCUDNN_CONVOLUTION_BWD_DATA_ALGO_1
) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of
CUDNN_STATUS_NOT_SUPPORTED
orCUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING
will be returned.Samples must be installed in a writable location, otherwise the samples can crash.
RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.
When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the
CUBLAS_WORKSPACE_CONFIG
environmental variable, for example,:16:8 or :4096:2
.The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is
:16:8:4096:2
, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the:16:8
non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the
_ALGO_0
algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to
W >= (R-1) * dilationW && H >= (S-1) * dilationH
, whereas, in cuDNN v8.0.x,W == (R-1) * dilationW || H == (S-1) * dilationH
cases are no longer supported.In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable
CUDNN_TEXOFF_DBG
. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knobCUDNN_KNOB_TYPE_USE_TEX
to 1 for engines that support texture-based load instructions.In the backend API, convolution forward engine with
CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1
is not supported when the product (channels * height * width
) of the input image exceeds 536,870,912 which is 2^29.cudnnSpatialTfSamplerBackward() returns
CUDNN_STATUS_NOT_SUPPORTED
when the number of channels exceeds 1024.When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return
CUDNN_STATUS_ARCH_MISMATCH
instead. The affected APIs include:cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer,
cuGetProcAddress
failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the--report-api-errors no
option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.
Deprecated and Removed Features
The following features are removed in cuDNN 9.1.0:
ppc64le is no longer supported. The last release to support ppc64le was 9.0.0.
RHEL7 is no longer supported. The last release to support RHEL7 was 9.0.0.
cuDNN documentation prior to version 8.5.0 will be removed in an upcoming release.
cuDNN 9.0.0
These are the NVIDIA cuDNN 9.0.0 Release Notes. These Release Notes include fixes from the previous cuDNN releases as well as the following additional changes.
Announcements
This is the first major version bump of cuDNN in almost 4 years. There are some exciting new features and also some changes that may be disruptive to current applications built against prior versions of cuDNN. This section provides more details.
The cuDNN library is reorganized into several sub-libraries, which, in a future cuDNN version, will allow for more flexibility in loading selected parts of the cuDNN library. For more information, refer to the API Overview.
For a list of added, deprecated, and removed APIs, refer to API Changes for cuDNN 9.0.0.
cuDNN no longer depends on the cuBLAS library; instead cuDNN now depends on the cuBLASLt library for certain primitive linear algebra operators.
The definition of
CUDNN_VERSION
has been changed toCUDNN_MAJOR * 10000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL
fromCUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL
. Refer to Version Checking Against CUDNN_VERSION in the cuDNN Developer Guide.cuDNN now has RPM and Debian meta-packages available for easy installation.
sudo apt-get install -y cudnn
This command installs the latest available cuDNN for the latest available CUDA version. Refer to the cuDNN Installation Guide for more details.
Starting with cuDNN 9.0.0, an important subset of operation graphs are hardware forward compatible. cuDNN 9.0.0 and subsequent releases will work on all current and future GPU architectures subject to specific constraints as documented in the cuDNN Developer Guide.
Key Features and Enhancements
The following features and enhancements have been added to this release:
The cuDNN backend API uses less memory for many execution plans which should be beneficial for users who cache execution plans.
FP16 and BF16 fused flash attention engine performance has been significantly improved for NVIDIA GPUs:
Speed-up of up to 50% over cuDNN 8.9.7 on Hopper GPUs.
Speed-up of up to 100% over cuDNN 8.9.7 on Ampere GPUs.
Expanded support of FP16 and BF16 flash attention by adding the gradient for relative positional encoding on NVIDIA Ampere GPUs.
The fusion engine enables pointwise operations in the mainloop to be fused on both input A and B for matmul. The fused pointwise operation can be either a scalar, row, column broadcast, or a full tensor pointwise operation. Mixed precision is also supported for both input A and B. This new feature is only available for NVIDIA Hopper GPUs.
Updated the cuDNN Graph API execution plan serialization JSON schema to version 3.
Introduced more specific error codes and error categories (
BAD_PARAM
,NOT_SUPPORTED
,INTERNAL_ERROR
,EXECUTION_FAILED
) which helps checking errors in these two levels of granularities. A macroCUDNN_ERROR_CATEGORY
is introduced for extracting the error category from a specific error code.Introduced nested logging levels, by setting
CUDNN_LOGLEVEL_DBG
, where the more severe levels are included by enabling the less severe levels. This better adheres to common practices and increases error reporting visibility. cuDNN version 8.x logging environment variablesCUDNN_LOGERR_DBG
,CUDNN_LOGWARN_DBG
, andCUDNN_LOGINFO_DBG
are deprecated and will continue to work during the cuDNN version 9.x grace period for compatibility.Introduced
cudnnGetLastErrorString
API to fetch the latest error message.The thread-safety of cuDNN is notably improved in this release. Concurrent execution of execution plans is now supported
Fixed Issues
On NVIDIA Ampere and Hopper architectures, invalid memory accesses were possible in variable sequence lengths and when the padded sequence length was not a multiple of 64. This issue has been fixed.
On NVIDIA Ampere and Hopper architectures, incorrect results were possible when the sequence length for query was less than 64. This issue has been fixed.
Fixed an accuracy issue in which FP32 input data is truncated instead of rounded into TF32 data on the NVIDIA Hopper fusion engine.
Previously on Linux, when cuDNN would load one of its other sub-libraries, it might attempt to load a mismatched version of cuDNN, possibly causing an application error. This issue has been fixed; it will look for the library file with complete version suffix first, and fall back to more generic version suffixes.
Fixed a serialization issue when a deserialized execution plan produced wrong results due to passing kernel parameters incorrectly.
For the
ConvBNFusion
operation,CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 2
had a race condition for some problem sets. This issue was fixed in cuDNN version 8.9.6.The NaN propagation is guaranteed under the
CUDNN_PROPAGATE_NAN
mode.
Known Issues
Running cuDNN with cuBlasLt prior to 11.3 could fail under graph capture mode.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3
for Norm Backward operations withcudnnBackendNormMode_t
set toCUDNN_RMS_NORM
is not CUDA minor version compatible for toolkit 12.2 and beyond. Users of this engine should install the updated driver that ships with the toolkit.It is known that
cudnnNanPropagation_t
may not be respected in conv-bias-relu fusions.There are caveats to the support offered by forward compatibility mode in this initial version. Refer to the cuDNN Developer Guide for more information.
For the
ConvBiasAct
operation,CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 39
may returnCUDNN_STATUS_EXECUTION_FAILED
when running on a system with CUDA Toolkit version 11.0.3 through 11.1. Upgrading the CUDA Toolkit version to 11.1 Update 1 or later should resolve this issue.ZLIB version 1.2.13 is statically linked into the cuDNN Windows dynamic libraries. Changing to the static linkage of ZLIB for other platforms will be addressed in future cuDNN releases.
A race condition in memory write accesses was flagged by the “compute-sanitizer” tool in some cuBLAS kernels invoked by the cuDNN multihead attention API cudnnMultiHeadAttnForward() on H100 GPUs. Extensive testing on H100, with different clock speeds and computational loads, did not reveal any impact on functional results that were always identical and correct. This issue is currently being investigated.
cuDNN’s usage of cuBLAS from CUDA Toolkit 12.1 may result in race-check failures when the library is tested under compute sanitizer. These failures are believed to be a cuBLAS issue and are being investigated. A workaround for this issue is to use cuBLAS from CUDA Toolkit 12.0.
A compiler bug in NVRTC in CUDA version 11.7 and earlier, was causing incorrect outputs when computing logical operations on boolean input tensors in the runtime fusion engine. A workaround had been integrated to avoid the most common issues. However, it is highly recommended to update to at least CUDA version 11.7u1 for a fix. Specifically, known failure cases are when pointwise operations of mode
CUDNN_POINTWISE_LOGICAL_NOT
,CUDNN_POINTWISE_LOGICAL_AND
, orCUDNN_POINTWISE_LOGICAL_OR
operates on boolean tensors.CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 57
for convolution forward andCUDNN_ATTR_ENGINE_GLOBAL_INDEX = 62
for convolution backward data may have performance regressions for non-zero beta problems. However, they are not typically recommended by cuDNN heuristics, so the observable impact should be minimal.Use of cudnnFusedOpsExecute() on Volta compatible architectures hosted on AArch64 systems may generate incorrect results when used with cudnnFusedOps_t set to
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_WGRAD
.With CUDA 11.7, the NVRTC library may cause a small memory leak when using the runtime fusion engine. The issue has been addressed in CUDA Toolkit 11.8 or later.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 0
forDgradDreluBnBwdWeights
may see a performance regression when moving from cuDNN 8.8 to cuDNN 8.9.Some convolution models are experiencing lower performance on RTX 3090 compared to 2080 Ti. This includes EfficientNet with up to 6x performance difference, UNet up to 1.6x performance difference and Tacotron up to 1.6x performance difference.
cudnnPoolingForward() with pooling mode
CUDNN_POOLING_AVG
might output NaN for pixel in output tensor outside the value recommended by cudnnGetPoolingNdForwardOutputDim() or cudnnGetPooling2dForwardOutputDim().Convolutions (
ConvolutionForward
,ConvolutionBackwardData
, andConvolutionBackwardFilter
) may experience performance regressions when run with math typeCUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION
onCUDNN_DATA_FLOAT
data (input and output).Compared to cuDNN 8.1.0, there are known performance regressions on certain dgrad NHWC configurations from FastPitch and WaveGlow models on V100 and NVIDIA A100 GPUs.
The numeric behavior of INT8 operations including saturation behavior, accumulator data types, and so on, have not been documented as of yet.
There is a known 25% performance regression for inference on the PyTorch SSD model on the NVIDIA Turing architecture.
Compared to cuDNN 8.0.5, there is a known ~17% performance regression on SSD models running on V100.
FFT and Winograd based algorithms for convolution do not support graph capture.
There is a known regression when running some convolutions with filter size 1x1. The severity would be different depending on which version of the CUDA Toolkit the user is using.
There is a known regression when running some convolutions with high group count. The issue is more severe on V100.
Compatibility
For the latest compatibility software versions of the OS, CUDA, the CUDA driver, and the NVIDIA hardware, refer to the cuDNN Support Matrix.
Limitations
Disabling CUDA context preemption on Windows can sometimes lead to
CUDNN_INTERNAL_ERRORS
being returned from convolution runs. When using cuDNN, do not disable CUDA context preemption.When using the cuDNN static library, you will need to use the same major.minor version of the CUDA Toolkit by which cuDNN was built to build your application. Refer to the cuDNN Support Matrix for the exact supported CUDA versions.
In cuDNN 8.9.0, runtime fusion engines (with
CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION
) will only work with NVRTC from CUDA Toolkit 11.8, 12.0 and 12.1. They are not guaranteed to be forward compatible with future CUDA 12.x Toolkits.The status returned by cudnnBackendFinalize() or cudnnBackendExecute() on a CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR may change depending on the version of the dynamic dependencies of cuDNN. As of this writing, only cuBLAS is known to affect the return status of these function calls.
The functional support criteria of cuDNN’s convolution kernels is not required to consider padding. Users of cuDNN can witness an unexpected lack of problem support when forward convolution spatial dimensions are less than the filter size and padding is nonzero but is sufficient to extend spatial dimensions to or beyond filter dimensions. This is commonly observed with, but not limited to, INT8 convolution kernels.
When performing batch normalization in cuDNN, the operation is allowed to proceed if the output tensor strides are overlapping, however, there is no guarantee of deterministic results.
CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 25
for convolution backwards data (which is part of legacyCUDNN_CONVOLUTION_BWD_DATA_ALGO_0
) does not support tensors in which the product N*C*H*W of the output gradient tensor equals to or exceeds 2^31.CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 1
for convolution backwards data (which is part of legacyCUDNN_CONVOLUTION_BWD_DATA_ALGO_1
) does not support tensors in which the product N*H*W of the output gradient tensor equals to or exceeds 2^31. This issue has been present in all previous releases of cuDNN and exercising the use case for the engine would show incorrect results.The runtime fusion engine is only supported in the cuDNN build based on CUDA Toolkit 11.2 update 1 or later; it also requires the NVRTC from CUDA 11.2 update 1 or later. If this condition is not satisfied, the error status of
CUDNN_STATUS_NOT_SUPPORTED
orCUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING
will be returned.Samples must be installed in a writable location, otherwise the samples can crash.
RNN and multihead attention API calls may exhibit non-deterministic behavior when the cuDNN library is built with CUDA Toolkit 10.2 or higher. This is the result of a new buffer management and heuristics in the cuBLAS library. As described in Results Reproducibility, numerical results may not be deterministic when cuBLAS APIs are launched in more than one CUDA stream via the same cuBLAS handle. This is caused by two buffer sizes (16 KB and 4 MB) used in the default configuration.
When a larger buffer size is not available at runtime, instead of waiting for a buffer of that size to be released, a smaller buffer may be used with a different GPU kernel. The kernel selection may affect numerical results. The user can eliminate the non-deterministic behavior of cuDNN RNN and multi-head attention APIs, by setting a single buffer size in the
CUBLAS_WORKSPACE_CONFIG
environmental variable, for example,:16:8 or :4096:2
.The first configuration instructs cuBLAS to allocate eight buffers of 16 KB each in GPU memory while the second setting creates two buffers of 4 MB each. The default buffer configuration in cuBLAS 10.2 and 11.0 is
:16:8:4096:2
, that is, we have two buffer sizes. In earlier cuBLAS libraries, such as cuBLAS 10.0, it used the:16:8
non-adjustable configuration. When buffers of only one size are available, the behavior of cuBLAS calls is deterministic in multi-stream setups.The tensor pointers and the filter pointers require at a minimum 4-byte alignment, including INT8 data in the cuDNN library.
Some computational options in cuDNN require increased alignment on tensors in order to run performantly. As always, cuDNN recommends users to align tensors to 16 byte boundaries which will be sufficiently aligned for any computational option in cuDNN. Doing otherwise may cause performance regressions in cuDNN 8.0.x compared to cuDNN v7.6.
For certain algorithms when the computation is in float (32-bit float) and the output is in FP16 (half float), there are cases where the numerical accuracy between the different algorithms might differ. cuDNN 8.0.x users can target the backend API to query the numerical notes of the algorithms to get the information programmatically. There are cases where algo0 and algo1 will have a reduced precision accumulation when users target the legacy API. In all cases, these numerical differences are not known to affect training accuracy even though they might show up in unit tests.
For the
_ALGO_0
algorithm of convolution backward data and backward filter, grouped convolution with groups larger than 1 and with odd product of dimensions C, D (if 3D convolution), H, and W is not supported on devices older than Volta. To prevent a potential illegal memory access by an instruction that only has a 16-bit version in Volta and above, pad at least one of the dimensions to an even value.In INT8x32 Tensor Core cases, the parameters supported by cuDNN v7.6 are limited to
W >= (R-1) * dilationW && H >= (S-1) * dilationH
, whereas, in cuDNN v8.0.x,W == (R-1) * dilationW || H == (S-1) * dilationH
cases are no longer supported.In prior versions of cuDNN, some convolution algorithms can use texture-based load instructure for performance improvements particularly in older hardware architectures. Users can opt-out of using texture using the environmental variable
CUDNN_TEXOFF_DBG
. In cuDNN 8.x, this variable is removed. Texture loading is turned off by default. Users who wish to continue to use texture-based load, can adapt the new backend API and toggle the engine knobCUDNN_KNOB_TYPE_USE_TEX
to 1 for engines that support texture-based load instructions.In the backend API, convolution forward engine with
CUDNN_ATTR_ENGINE_GLOBAL_INDEX=1
is not supported when the product (channels * height * width
) of the input image exceeds 536,870,912 which is 2^29.cudnnSpatialTfSamplerBackward() returns
CUDNN_STATUS_NOT_SUPPORTED
when the number of channels exceeds 1024.When using graph-capture, users should call the sub-library version check API (for example, cudnnOpsVersionCheck() or cudnnGraphVersionCheck()) to load the kernels in the sub-library prior to opening graph capture. cuDNN 9.0.0 APIs that poll for resource usage, such as requested workspace sizes, are not always compatible with CUDA graph-capture. Users that rely upon these APIs being CUDA graph-capture compatible, should first execute their workloads during a “warm up” run before attempting graph-capture.
Users of cuDNN need to add the dependencies of cuBLAS to the linkers command explicitly to resolve the undefined symbols from cuDNN static libraries.
Starting in version 8.1, cuDNN uses AVX intrinsics on the x86_64 architecture; users of this architecture without support for AVX intrinsics may see illegal instruction errors.
The spatial persistent batch normalization API is only available for Pascal and later architectures. Pre-Pascal architectures return
CUDNN_STATUS_ARCH_MISMATCH
instead. The affected APIs include:cudnnAddTensor() performance may regress from 8.2 to 8.3 for pre-Pascal architectures.
When applications using cuDNN with an older 11.x CUDA toolkit in compatibility mode are tested with compute-sanitizer,
cuGetProcAddress
failures with error code 500 will arise due to missing functions. This error can be ignored, or suppressed with the--report-api-errors no
option, as this is due to CUDA backward compatibility checking if a function is usable with the CUDA toolkit combination. The functions are introduced in a later version of CUDA but are not available on the current platform. The absence of these functions is harmless and will not give rise to any functional issues.The fused attention and flash attention runtime engines have been disabled for NVRTC 11.8 due to compiler limitations.
Deprecated and Removed Features
The following features are deprecated in cuDNN 9.0.0:
For a list of deprecated and removed APIs, refer to API Changes for cuDNN 9.0.0.