Hardware Forward-Compatibility

Starting with cuDNN version 9.0.0, the graph patterns listed under the Supported Operations section are guaranteed to be hardware forward-compatible. In the following paragraphs, the usage of forward-compatibility or compatibility is meant to indicate hardware forward-compatibility unless explicitly noted otherwise.

Forward-Compatibility Mode Vs Native Mode

Every release, the cuDNN library has a list of SM versions that it supports natively, listed in the cuDNN Support Matrix. When running on one of those SM versions, the library is in native mode. By native support, we mean the library includes explicit CUDA kernels specifically compiled for that SM architecture.

On the other hand, when running on a newer device than what is natively supported, the library is in forward-compatibility mode, whereby PTX JIT compilation of CUDA kernels will be used to support library workloads.

Note

Each version of the cuDNN library stores the highest SM number of the GPU it supports natively; this will most often be the SM number of the latest NVIDIA GPU in production when the cuDNN version was released. This value can be queried by calling the function cudnnGetMaxDeviceVersion(). For any GPU with a SM number higher than this number, the SM will be supported in forward-compatibility mode.

Configuring Forward-Compatibility Mode

By default, forward-compatibility mode is automatically enabled when the library detects that it is being run on a new GPU which it does not support natively. If that’s not desired, export the ENV variable CUDNN_FORWARD_COMPAT_DISABLE=1 to disable the forward-compatibility mode.

In such a case, the library may fail on a future device which it does not support natively

If support for a future hardware device is required we recommend either:

Upgrade the cuDNN library to a version which supports the device natively (this will give best available support on the new hardware); or

Leave forward-compatibility support enabled by default (that is, CUDNN_FORWARD_COMPAT_DISABLE=0 or not exported), which means the library will by default try to support future architectures of the NVIDIA GPU by running in forward-compatibility mode.

Forward-Compatibility Workflows

As a general principle, any use of the cuDNN graph API that works functionally on current hardware should work on future hardware, with the caveat that not all graph patterns are supported currently (refer to the Supported Operations section).

This means that even advanced usage like direct specification of an engine ID or engine config knobs will not prevent forward-compatibility mode from working functionally. Under the hood, if necessary, the library will replace a non-forward compatible engine config with a forward compatible engine config (more details on this later).

However, for best possible performance, we recommend the heuristics workflow. This is the recommended flow even when not in forward-compatibility mode, so we don’t expect this to be an issue for most use cases.

The following subsections give more details on the different workflows.

Heuristics-Based Workflow

This use case applies when the following are true:

You want the fastest available engine config for the operation graph in forward-compatibility mode.

You do not mind querying cuDNN heuristics when running on the new device the first time.

You are okay with running an engine other than the one they were running previously in native mode.

In such a case, we recommend users to always call cuDNN heuristic (specifically CUDNN_HEUR_MODE_A, because currently no other modes are tuned for forward compatibility) to query the list of engine configs which is recommended for the operation graph and pick among them. This implies the user should never use a previously auto-tuned or otherwise finalized engine config when switching to the new device for the first time from an older device.

Heuristics-Based Workflow Specifics

In forward-compatibility mode, CUDNN_HEUR_MODE_A has the following important properties:

it will not return an engine config which is not forward compatible when the heuristics is called under the forward-compatibility mode.

it’s specifically tuned for forward-compatibility mode and hence can offer more performant configs.

Note that even disregarding forward-compatibility mode, heuristic in cuDNN today does not guarantee that the returned list of engine configs will finalize for the given input problem. So at worst, it might be that none of the engine configs returned from the heuristic query will finalize for the given input problem. Users are required to catch such cases and handle appropriately by using a fallback config from CUDNN_HEUR_MODE_FALLBACK mode of heuristic. In forward-compatibility mode, this behavior of heuristic is kept consistent, and the user is required to build their application so as to handle the corner cases where none of the engine configs returned from heuristic query can be finalized successfully.

Since the recommended engine config may be different when heuristic is queried in forward-compatibility mode versus in native mode, the numeric properties and performance of the recommended engine configs may not be compared between the two modes. Users are expected to query the behavior note of the recommended engine and filter out engines which possess numerical properties that they do not desire in forward compatibility mode similar to what they do in native mode.

Specifics When Not Following A Heuristics-Based Workflow

As mentioned earlier, the heuristics workflow is recommended, but not required to get forward compatibility support. If you are not following the heuristics workflow today, and do not want to make code changes, you do not need to make any changes in your code or the way you are calling into cuDNN and a previously working graph API call should continue to work seamlessly.

When forward-compatibility support is enabled, the cuDNN library automatically detects when running on a newer SM than what is supported by native mode, and turns on forward-compatibility mode. The library then tries to catch all the forward-compatibility related errors (errors which otherwise would not exist when run on a natively supported GPU). When the library catches such an error, say for example, when a user tries to run an engine config which is specific to an older architecture and is not supported in a newer architecture, it tries to safely replace the underlying engine config with another engine config which has comparable numerics and run time behavior (that is, static compilation versus runtime compilation) and ensure the operation succeeds.

Since the engine config will be replaced by the library when a forward-compatibility error is encountered, the knobs which were relevant to the previous engine config are no longer relevant and will be ignored. Essentially, the library tries best to honor the user choice of engine config, but in case such a config happens to fail for forward-compatibility issues, the library has built-in error handling mechanisms to swap it with a forward-compatible config to guarantee that a previously working operation graph will always continue to succeed. This ensures the error is not propagated up the stack to the user application.

Currently, the substituted engine config may not provide comparable performance as that of natively supported engine config but nonetheless functionality will not be compromised.

Because forward-compatibility mode relies on PTX JIT (Just In Time) compilation, there can be compilation overheads. As with any CUDA application, those overheads can be managed with lazy loading (for example, refer to the CUDA_MODULE_LOADING environment variable), and caching JITed kernels, refer to this CUDA section for more information on JIT caching.

Supported Graph Patterns

While the goal is to make all operation graphs that can be successfully finalized in cuDNN forward-compatible, that is not yet true today. The following is the set of graph operations which are forward-compatible as of this release:

Pre-Compiled Single Operation Engines
- This includes ConvolutionFwd, ConvolutionBwFilter, and ConvolutionBwData, or ConvolutionBwBias, NormalizationForward, and NormalizationBackward.
Generic Runtime Fusion Engines with the exceptions noted in Limitations to Forward-Compatibility Support in Generic Runtime Fusion Engines
Specialized Pre-Compiled Engines
- ConvBiasAct
- ConvScaleBiasAct

The operation graphs which are not listed above are not yet made forward-compatible; however, they are expected to be made forward-compatible in subsequent releases. Some notable examples of unsupported graph patterns currently, are:

Specialized Runtime Fusion Engines for graph patterns of:
- Fused Attention and Fused Flash Attention (fprop and bprop)
- BnAddRelu, DReluForkDBn
Specialized Pre-Compiled Engines for graph patterns of:
- ConvBNFprop, ConvBNWgrad, DgradDreluBNBwdWeight
- FP8 fused flash attention

Limitations to Forward-Compatibility Support in Generic Runtime Fusion Engines

There are some known issues in forward-compatibility mode for the following cases within the generic runtime fusion engines:

Any generic runtime fusion engine using FP8.

Some convolution fusions involving the following patterns:

Convolution Forward operation followed by operations of g ₂ (in Generic Runtime Fusion Engines) supports only broadcasting to full tensors and does not support broadcasting between vector or scalar tensors. Using the latter may produce incorrect results.

Fusion patterns involving grouped convolutions, that is convolution with G>1, are not supported.

Convolution Forward operation followed by operations of g ₂ (in Generic Runtime Fusion Engines) involving CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR does not support the following patterns of reduction:

[N, K, P, Q] -> [N, 1, 1, 1]

[N, K, P, Q] -> [N, K, 1, 1]

Fusion patterns in g ₁ (in Generic Runtime Fusion Engines) combining CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR which involve CUDNN_POINTWISE_MUL (pointwise:Scale), CUDNN_POINTWISE_ADD (pointwise:Bias), CUDNN_POINTWISE_RELU_FWD (pointwise:Relu) modes support only data type of half I/O and compute type of float32 for convolutionForward / ConvolutionBackwardFilter, while ConvolutionBackwardData is not supported for any data type.

MatMul fusions with mixed precision input tensorsare not supported.

Prerequisites

cuDNN forward-compatibility requires the following prerequisites prior to using the feature:

Builds which use CUDA toolkit 12 or higher.
cuDNN’s graph API interface.