Compatibility#
There are two important “compatibility” concepts that we document in this chapter:
cuDNN API compatibility, which refers to forward and backward compatibility of cuDNN with applications built against other versions of cuDNN
cuDNN Hardware forward compatibility, which refers to compatibility of a given cuDNN version with future hardware
The rest of the chapter details these concepts.
cuDNN API Compatibility#
Beginning in cuDNN 7, the binary compatibility of a patch and minor releases is maintained as follows:
Any patch release x.y.z is forward or backward-compatible with applications built against another cuDNN patch release x.y.w (meaning, of the same major and minor version number, but having w!=z).
cuDNN minor releases are binary backward-compatible with applications built against the same or earlier minor release (meaning, cuDNN x.y is binary compatible with an app built against cuDNN x.z, where z<=y).
Applications built against cuDNN version x.z are not guaranteed to work with x.y release when z>y.
API compatibility is not guaranteed across major versions of cuDNN. In other words, applications built against cuDNN version y.x are not guaranteed to work with cuDNN version z.x if y != z.
cuDNN Hardware Forward Compatibility#
A build of cuDNN is hardware forward compatible if it functionally works on future hardware. This allows an application linked against cuDNN to continue working on future hardware without updating to a newer cuDNN release. However, there are caveats:
This feature was added in cuDNN 9.0.0, and only for builds which use CUDA toolkit 12 or higher. Prior versions of cuDNN are not hardware forward compatible.
cuDNN has some limitations to its hardware forward compatibility. These limitations are documented later in this section.
CUDNN_LIB_CONFIG
setting needs to be set toFULL
for forward compatibility.
In this section, the usage of forward-compatibility or compatibility is meant to indicate hardware forward-compatibility unless explicitly noted otherwise.
Every release, the cuDNN library has a list of SM versions that it supports natively, listed in the cuDNN Support Matrix. When running on one of those SM versions, the library is in native mode. By native support, we mean the library includes explicit CUDA kernels specifically compiled for that SM architecture.
On the other hand, when running on a newer device than what is natively supported, the library is in forward-compatibility mode, whereby PTX JIT compilation of CUDA kernels will be used to support library workloads.
Note
Each version of the cuDNN library stores the highest SM number of the GPU it supports natively; this will most often be the SM number of the latest NVIDIA GPU in production when the cuDNN version was released. This value can be queried by calling the function
cudnnGetMaxDeviceVersion()
. For any GPU with a SM number higher than this number, the SM will be supported in forward-compatibility mode.
Configuring Forward-Compatibility Mode#
By default, forward-compatibility mode is automatically enabled when the library detects that it is being run on a new GPU which it does not support natively. If that’s not desired, export the ENV
variable CUDNN_FORWARD_COMPAT_DISABLE=1
to disable the forward-compatibility mode.
In such a case, the library may fail on a future device which it does not support natively
If support for a future hardware device is required we recommend either:
Upgrade the cuDNN library to a version which supports the device natively (this will give best available support on the new hardware); or
Leave forward-compatibility support enabled by default (that is,
CUDNN_FORWARD_COMPAT_DISABLE=0
or not exported), which means the library will by default try to support future architectures of the NVIDIA GPU by running in forward-compatibility mode.
Forward-Compatibility and the Graph API#
As a general principle, any use of the cuDNN graph API that works functionally on current hardware should work on future hardware, with the caveat that not all graph patterns are supported currently (refer to the Supported Graph Patterns section).
Note that like in native mode, deserialization of an execution plan that was serialized on older hardware is not supported. The execution plan must be serialized and deserialized on matching hardware.
Even advanced usage like direct specification of an engine ID or engine config knobs will not prevent forward-compatibility mode from working functionally. Under the hood, if necessary, the library will replace a non-forward compatible engine config with a forward compatible engine config (more details on this later).
However, for best possible performance, we recommend the heuristics workflow. This is the recommended flow even when not in forward-compatibility mode, so we don’t expect this to be an issue for most use cases.
The following subsections give more details on the different workflows.
Heuristics-Based Workflow#
This use case applies when the following are true:
You want the fastest available engine config for the operation graph in forward-compatibility mode.
You do not mind querying cuDNN heuristics when running on the new device the first time.
You are okay with running an engine other than the one they were running previously in native mode.
In such a case, we recommend users to always call cuDNN heuristic (specifically CUDNN_HEUR_MODE_A
, because currently no other modes are tuned for forward compatibility) to query the list of engine configs which is recommended for the operation graph and pick among them. This implies the user should never use a previously auto-tuned or otherwise finalized engine config when switching to the new device for the first time from an older device.
In forward-compatibility mode, CUDNN_HEUR_MODE_A
has the following important properties:
it will not return an engine config which is not forward compatible when the heuristics is called under the forward-compatibility mode.
it’s specifically tuned for forward-compatibility mode and hence can offer more performant configs.
Note that even disregarding forward-compatibility mode, heuristic in cuDNN today does not guarantee that the returned list of engine configs will finalize for the given input problem. So at worst, it might be that none of the engine configs returned from the heuristic query will finalize for the given input problem. Users are required to catch such cases and handle appropriately by using a fallback config from CUDNN_HEUR_MODE_FALLBACK mode of heuristic. In forward-compatibility mode, this behavior of heuristic is kept consistent, and the user is required to build their application so as to handle the corner cases where none of the engine configs returned from heuristic query can be finalized successfully.
Since the recommended engine config may be different when heuristic is queried in forward-compatibility mode versus in native mode, the numeric properties and performance of the recommended engine configs may not be compared between the two modes. Users are expected to query the behavior note of the recommended engine and filter out engines which possess numerical properties that they do not desire in forward compatibility mode similar to what they do in native mode.
Specifics When Not Following A Heuristics-Based Workflow#
As mentioned earlier, the heuristics workflow is recommended, but not required to get forward compatibility support. If you are not following the heuristics workflow today, and do not want to make code changes, you do not need to make any changes in your code or the way you are calling into cuDNN and a previously working graph API call should continue to work seamlessly.
When forward-compatibility support is enabled, the cuDNN library automatically detects when running on a newer SM than what is supported by native mode, and turns on forward-compatibility mode. The library then tries to catch all the forward-compatibility related errors (errors which otherwise would not exist when run on a natively supported GPU). When the library catches such an error, say for example, when a user tries to run an engine config which is specific to an older architecture and is not supported in a newer architecture, it tries to safely replace the underlying engine config with another engine config which has comparable numerics and ensure the operation succeeds.
Since the engine config will be replaced by the library when a forward-compatibility error is encountered, the knobs which were relevant to the previous engine config are no longer relevant and will be ignored. Essentially, the library tries best to honor the user choice of engine config, but in case such a config happens to fail for forward-compatibility issues, the library has built-in error handling mechanisms to swap it with a forward-compatible config to guarantee that a previously working operation graph will always continue to succeed. This ensures the error is not propagated up the stack to the user application.
Currently, the substituted engine config may not provide comparable performance as that of natively supported engine config but nonetheless functionality will not be compromised.
Because forward-compatibility mode relies on PTX JIT (Just In Time) compilation, there can be compilation overheads. As with any CUDA application, those overheads can be managed with lazy loading (for example, refer to the CUDA_MODULE_LOADING environment variable), and caching JITed kernels, refer to this CUDA section for more information on JIT caching.
Supported Graph Patterns#
While the goal is to make all operation graphs that can be successfully finalized in cuDNN forward-compatible, that is not yet true today. The following is the set of graph operations which are forward-compatible as of this release:
Pre-Compiled Single Operation Engines
This includes
ConvolutionFwd
,ConvolutionBwdFilter
, andConvolutionBwdData
, orConvolutionBwBias
,NormalizationForward
, andNormalizationBackward
.
Specialized Pre-Compiled Engines
- ConvBNFprop
Full pattern: Scale + Bias + ReLU + Conv + GenStats
Partial pattern: Conv + GenStats
The operation graphs which are not listed above are not yet made forward-compatible; however, they are expected to be made forward-compatible in subsequent releases. Some notable examples of unsupported graph patterns currently, are:
Specialized Runtime Fusion Engines for graph patterns of Fused Attention and Fused Flash Attention (
fprop
andbprop
)Specialized Pre-Compiled Engines for graph patterns of:
ConvBNFprop: DBARCS pattern (short for Dual, Scale, Bias, Add, ReLU, Conv genStats)
Any graph operation involving the FP8 data type
There are some known issues in forward-compatibility mode for the following cases within the generic runtime fusion engines:
Some convolution fusions involving the following patterns:
Convolution Forward operation or data gradient operation followed by operations of g 2 (in Generic Runtime Fusion Engines) supports only broadcasting to full tensors and does not support broadcasting between vector or scalar tensors. Using the latter may produce incorrect results.
Fusion patterns involving grouped convolutions, that is convolution with G>1, are not supported for data gradient and weight gradient operations.
Convolution Forward operation followed by operations of g 2 (in Generic Runtime Fusion Engines) involving
CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
does not support the following patterns of reduction:
[N, K, P, Q] -> [N, 1, 1, 1]
[N, K, P, Q] -> [N, K, 1, 1]
Fusion patterns in g 1 (in Generic Runtime Fusion Engines) combining
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
which involveCUDNN_POINTWISE_MUL (pointwise:Scale)
,CUDNN_POINTWISE_ADD (pointwise:Bias)
,CUDNN_POINTWISE_RELU_FWD (pointwise:Relu)
modes support only data type of half I/O and compute type of float32 forconvolutionForward
/ConvolutionBackwardFilter
, whileConvolutionBackwardData
is not supported for any data type.Matmul operations with mixed-precision input tensors are not supported.
Matmul fusions where mainloop fusions have more than one full tensor on either input branches are not supported.
For fusion patterns involving float inputs to data gradient operations, only NHWC layout format is supported.
Forward-Compatibility and the Legacy API#
Most of the legacy API is forward compatible today. For example, the following commonly used legacy API routines are forward compatible:
Tensor Transformations
RNN (Recurrent Neural Network)
Batch Normalization
The following is an exhaustive list of legacy API routines that are not yet forward compatible. All other legacy API usage is forward compatible.
Fusions
cudnnFusedOpsExecute() with the following values for the
cudnnFusedOps_t ops
parameter:
CUDNN_FUSED_SCALE_BIAS_ACTIVATION_CONV_BNSTATS
CUDNN_FUSED_BN_FINALIZE_STATISTICS_TRAINING
CUDNN_FUSED_BN_FINALIZE_STATISTICS_INFERENCE
CUDNN_FUSED_SCALE_BIAS_ADD_ACTIVATION_GEN_BITMASK
CUDNN_FUSED_DACTIVATION_FORK_DBATCHNORM
It is also worth noting that while portions of the legacy API are deprecated, this does not impact forward compatibility. The above list is the source of truth for what isn’t forward compatible within the legacy API.