Graph API

The cuDNN library provides a declarative programming model for describing computation as a graph of operations. This graph API was introduced in cuDNN 8.0 to provide a more flexible API, especially with the growing importance of operation fusion.

The user starts by building a graph of operations. At a high level, the user is describing a dataflow graph of operations on tensors. Given a finalized graph, the user then selects and configures an engine that can execute that graph. There are several methods for selecting and configuring engines, which have tradeoffs with respect to ease-of-use, runtime overhead, and engine performance.

The graph API has two entry points:

Backend Descriptor Types (lowest level entry point into the graph API)

NVIDIA cuDNN Frontend API (convenience layer on top of the C backend API)

We expect that most users prefer the cuDNN frontend API because:

It is less verbose without loss of control - all functionality accessible through the backend API is also accessible through the frontend API.

It adds functionality on top of the backend API, like errata filters and autotuning.

It is open source.

In either case (that is, the backend or frontend API), the high level concepts are the same.

Key Concepts

As mentioned previously, the key concepts in the graph API are:

Operations and Operation Graphs

Engines and Engine Configurations

Heuristics

These concepts are covered in the subsections below. Later we’ll go through an example to tie them all together.

Operations and Operation Graphs

An operation graph is a dataflow graph of operations on tensors. It is meant to be a mathematical specification and is decoupled from the underlying engines that can implement it, as there may be more than one engine available for a given graph.

I/O tensors connect the operations implicitly, for example, an operation A may produce a tensor X, which is then consumed by operation B, implying that operation B depends on operation A.

Engines and Engine Configurations

For a given operation graph, there are some number of engines that are candidates for implementing that graph. The typical way to query for a list of candidate engines is through a heuristics query, covered below.

An engine has knobs for configuring properties of the engine, like tile size (refer to cudnnBackendKnobType_t).

Heuristics

A heuristic is a way to get a list of engine configurations that are intended to be sorted from the most performant to least performant for the given operation graph. There are three modes:

CUDNN_HEUR_MODE_A - intended to be fast and be able to handle most operation graph patterns. It returns a list of engine configs ranked by the expected performance.

CUDNN_HEUR_MODE_B - intended to be more generally accurate than mode A, but with the tradeoff of higher CPU latency to return the list of engine configs. The underlying implementation may fall back to the mode A heuristic in cases where we know mode A can do better.

CUDNN_HEUR_MODE_FALLBACK - intended to be fast and provide functional fallbacks without expectation of optimal performance.

The recommended workflow is to query either mode A or B and check for support. The first engine config with support is expected to have the best performance.

You can “auto-tune”, that is, iterate over the list and time for each engine config and choose the best one for a particular problem on a particular device. The cuDNN frontend API provides a convenient function, cudnnFindPlan(), which does this.

If all the engine configs are not supported, then use the mode fallback to find the functional fallbacks.

Expert users may also want to filter engine configs based on properties of the engine, such as numerical notes, behavior notes, or adjustable knobs. Numerical notes inform the user about the numerical properties of the engine such as whether it does datatype down conversion at the input or during output reduction. The behavior notes can signal something about the underlying implementation like whether or not it uses runtime compilation. The adjustable knobs allow fine grained control of the engine’s behavior and performance.

Graph API Example with Operation Fusion

The following example implements a fusion of convolution, bias, and activation.

Creating Operation and Tensor Descriptors to Specify the Graph Dataflow

First, create three cuDNN backend operation descriptors.

In the following figure, the user specified one forward convolution operation (using CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR), a pointwise operation for the bias addition (using CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_ADD), and a pointwise operation for the ReLU activation (using CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_RELU_FWD). Refer to the Backend Descriptor Types for more details on setting the attributes of these descriptors. For an example of how a forward convolution can be set up, refer to the Setting Up An Operation Graph For A Grouped Convolution use case.

You should also create tensor descriptors for the inputs and outputs of all of the operations in the graph. The graph dataflow is implied by the assignment of tensors, for example, by specifying the backend tensor Tmp0 as both the output of the convolution operation and the input of the bias operation, cuDNN infers that the dataflow runs from the convolution into the bias. The same applies to tensor Tmp1. If the user doesn’t need the intermediate results Tmp0 and Tmp1 for any other use, then the user can specify them to be virtual tensors, so the memory I/Os can later be optimized out.

Graphs with more than one operation node do not support in-place operations (that is, where any of the input UIDs matches any of the output UIDs). Such in-place operations are considered cyclic in later graph analysis and deemed unsupported. In-place operations are supported for single-node graphs.

The operation descriptors can be created and passed into cuDNN in any order, as the tensor UIDs are enough to determine the dependencies in the graph.

A Set of Operation Descriptors the User Passes to the Operation Graph

Finalizing The Operation Graph

Second, the user finalizes the operation graph. As part of finalization, cuDNN performs the dataflow analysis to establish the dependency relationship between operations and connect the edges, as illustrated in the following figure. In this step, cuDNN performs various checks to confirm the validity of the graph.

Configuring An Engine That Can Execute The Operation Graph

Third, given the finalized operation graph, the user must select and configure an engine to execute that graph, which results in an execution plan. As mentioned in Heuristics, the typical way to do this is:

Query heuristics mode A or B.
Look for the first engine config with functional support (or auto-tune all the engine configs with functional support).
If no engine config was found in (2), try querying the fallback heuristic for more options.

Executing The Engine

Finally, with the execution plan constructed and when it comes time to run it, the user should construct the backend variant pack by providing the workspace pointer, an array of UIDs, and an array of device pointers. The UIDs and the pointers should be in the corresponding order. With the handle, the execution plan and variant pack, the execution API can be called and the computation is carried out on the GPU.

Supported Graph Patterns

The cuDNN Graph API supports a set of graph patterns. These patterns are supported by a large number of engines, each with their own support surfaces. These engines are grouped into four different classes, as reflected by the following four subsections: pre-compiled single operation engines, generic runtime fusion engines, specialized runtime fusion engines, and specialized pre-compiled fusion engines. The specialized engines, whether they use runtime compilation or pre-compilation, are targeted to a set of important use cases, and thus have a fairly limited set of patterns they currently support. Over time, we expect to support more of those use cases with the generic runtime fusion engines, whenever practical.

Since these engines have some overlap in the patterns they support, a given pattern may result in zero, one, or more engines.

Pre-Compiled Single Operation Engines

One basic class of engines includes pre-compiled engines that support an operation graph with just one operation; specifically: ConvolutionFwd, ConvolutionBwdFilter, ConvolutionBwdData, or ConvolutionBwBias. Their more precise support surface can be found in the NVIDIA cuDNN API Reference.

ConvolutionBwdData

ConvolutionBwdData computes the convolution data gradient of the tensor dy. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionBackwardData().

ConvolutionBwdFilter

ConvolutionBwdFilter computes the convolution filter gradient of the tensor dy. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionBackwardFilter().

ConvolutionFwd

ConvolutionFwd computes the convolution of X with filter data W. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionForward().

NormalizationBackward

NormalizationBackward computes the gradient dX and the scale and bias gradients dScale and dBias. This operation supports multiple modes which are set by the attribute CUDNN_ATTR_OPERATION_NORM_BWD_MODE. The precompiled engines support instance and layer norm backward while batch norm backward is supported by a specialized runtime compiled engine (refer to DReluForkDBn). The mean and variance saved during the forward training pass is passed as input to the NormBackward operation.

Instance Norm, Layer Norm, and RMS Norm For `NormalizationBackward`
Node and Other Attributes	Instance Normalization Backward	Layer Normalization Backward	RMS Normalization Backward
`operation`	`normBwd`	`normBwd`	`normBwd`
`X`	[N, C, (D), H, W], input, I type	[N, C, (D), H, W], input, I type	[N, C, (D), H, W], input, I type
`Mean`	[N,C,(1),1,1], input, compute type	[N,1,(1),1,1], input, compute type	N/A
`InvVariance`	[N,C,(1),1,1], input, compute type	[N,1,(1),1,1], input, compute type	[N,1,(1),1,1], input, compute type
`Scale`	[1,C,(1),1,1], input weight, W type	[1,C,(D),H,W], input weight, W type	[1,C,(D),H,W], input weight, W type
`DY`	[N, C, (D), H, W], input, O type	[N, C, (D), H, W], input, O type	[N, C, (D), H, W], input, O type
`DX`	[N, C, (D), H, W], output, I type	[N, C, (D), H, W], output, I type	[N, C, (D), H, W], output, I type
`Dscale`	[1,C,(1),1,1], output, W type	[1,C,(D),H,W], output, W type	[1,C,(D),H,W], output, W type
`Dbias`	[1,C,(1),1,1], output, W type	[1,C,(D),H,W], output, W type	Optional
`mode`	`CUDNN_INSTANCE_NORM`	`CUDNN_LAYER_NORM`	`CUDNN_RMS_NORM`
Supported layout	NC(D)HW, N(D)HWC	NC(D)HW, N(D)HWC	NC(D)HW, N(D)HWC
Supported I and O types	FP16, FP32, BF16	FP16, FP32, BF16	FP16, FP32, BF16
Supported compute type	FP32	FP32	FP32
Supported W types	FP32	FP16, FP32, BF16	FP16, FP32, BF16
Alignment requirements for I/O type	8 bytes aligned	16 bytes aligned	16 bytes aligned

For each operation, all applicable tensors must have the same layout. Neither mixed I/O types, nor mixed compute types are supported.

Layer Norm and RMS Norm are also supported by specialized runtime compiled engines, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for normalization backward operations. For these more performant engines, sizeof(Itype) >= sizeof(Otype) restriction applies.

NormalizationForward

NormalizationForward computes the normalization output Y from the input X. This operation is used in both the inference and training phase. The phases are distinguished by the attribute CUDNN_ATTR_OPERATION_NORM_FWD_PHASE.

This operation supports different normalization modes which are set by the attribute CUDNN_ATTR_OPERATION_NORM_FWD_MODE. The dashed lines indicate optional inputs, which are typically used in the batch norm mode of this operation. Currently, instance norm and layer norm are supported via both precompiled and specialized runtime compiled engines while batch norm and RMS norm are supported using specialized runtime compiled engines (refer to BnAddRelu).

Instance Norm, Layer Norm, and RMS Norm For `NormalizationForward`
Node and Other Attributes	Instance Normalization Forward	Layer Normalization Forward	RMS Normalization Forward
`operation`	`normFwd`	`normFwd`	`normFwd`
`X`	[N, C, (D), H, W], input, I type	[N, C, (D), H, W], input, I type	[N, C, (D), H, W], input, I type
`Mean`	[N,C,(1),1,1], output, compute type, only applicable to `fmode CUDNN_NORM_FWD_TRAINING`	[N,1,(1),1,1], output, compute type, only applicable to `fmode CUDNN_NORM_FWD_TRAINING`	N/A
`InvVariance`	[N,C,(1),1,1], output, compute type, only applicable to `fmode CUDNN_NORM_FWD_TRAINING`	[N,1,(1),1,1], output, compute type, only applicable to `fmode CUDNN_NORM_FWD_TRAINING`	[N,1,(1),1,1], output, compute type,only applicable to `fmode CUDNN_NORM_FWD_TRAINING`
`Scale`	[1,C,(1),1,1], input weight, W type	[1,C,(D),H,W], input weight, W type	[1,C,(D),H,W], input weight, W type
`Bias`	[1,C,(1),1,1], input weight, W type	[1,C,(D),H,W], input weight, W type	Optional (no bias by default)
`Y`	[N, C, (D), H, W], output, O type	[N, C, (D), H, W], output, O type	[N, C, (D), H, W], output, O type
`epsilonDesc`	[1,1,1,1], input, constant	[1,1,1,1], input, constant	[1,1,1,1], input, constant
`mode`	`CUDNN_INSTANCE_NORM`	`CUDNN_LAYER_NORM`	`CUDNN_RMS_NORM`
Supported `fmode`	`CUDNN_NORM_FWD_TRAINING`, `CUDNN_NORM_FWD_INFERENCE`	`CUDNN_NORM_FWD_TRAINING`, `CUDNN_NORM_FWD_INFERENCE`	`CUDNN_NORM_FWD_TRAINING`, `CUDNN_NORM_FWD_INFERENCE`
Supported layout	NC(D)HW, N(D)HWC	NC(D)HW, N(D)HWC	NC(D)HW, N(D)HWC
Supported I and O types	FP16, FP32, BF16	FP16, FP32, BF16	FP16, FP32, BF16
Supported compute type	FP32	FP32	FP32
Supported weight types, W type	FP32	FP16, FP32, BF16	FP16, FP32, BF16
Alignment requirements for I/O type	8 bytes aligned	16 bytes aligned	16 bytes aligned

For each operation, all applicable tensors must have the same layout.

Layer Norm and RMS Norm are also supported by specialized runtime compiled engines, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 with fmode set to CUDNN_NORM_FWD_TRAINING and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 4 with fmode set to CUDNN_NORM_FWD_INFERENCE. For these more performant engines, sizeof(Itype) >= sizeof(Otype) restriction applies. For non-runtime compiled engines, W type is the compute type.

Generic Runtime Fusion Engines

The engines documented in the previous section support single-op patterns. Of course, for fusion to be interesting, the graph needs to support multiple operations. And ideally, we want the supported patterns to be flexible to cover a diverse set of use cases. To accomplish this generality, cuDNN has runtime fusion engines that generate the kernel (or kernels) at runtime based on the graph pattern. This section outlines the patterns supported by these runtime fusion engines (that is, engines with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION behavioral note).

We can think of the support surface as covering the following generic patterns:

Matmul fusions: \(g_{2}\left( C=Matmul\left( A=g_{1A} \left( inputs \right), B=g_{1B} \right(inputs)), inputs \right)\)
ConvolutionFwd fusions: \(g_{2}\left( Y=ConvolutionFwd\left( X=g_{1} \left( inputs \right), W\right), inputs \right)\)
ConvolutionBwdFilter fusions: \(g_{2}\left( dw=ConvolutionBwdFiler\left( dy, X=g_{1} \right(inputs)), inputs \right)\)
ConvolutionBwdData fusions: \(g_{2}\left( dx=ConvolutionBwdData\left( dy=g_{1} \left( inputs \right), W \right), inputs \right)\)
Pointwise fusions: \(g_{2}\left( inputs \right)\)

Graphical Representation of the Generic Patterns Supported by the Runtime Fusion Engines

Note

g ₁ (including g _1A and g _1B) indicates fusion operations applied to the inputs of the matmul and convolution operation.
g ₂ indicates fusion operations apply to the output of the matmul and convolution operation.
g ₂ can have more than one output.
The fusion patterns in g ₁ will be referred to as the mainloop fusion, and the fusion patterns in g ₂ will be referred to as the epilogue fusion.
The arrow going into g ₂ can go into any of g ₂ nodes and does not necessarily need to feed into a root node.
The abbreviated notations for operations are used in the diagrams and throughout the text for visualization purposes.

Support Surface

The generic runtime fusion engine includes three independent support surfaces indexed as 90, 80, and 70. A cuDNN graph that fulfills the requirements of at least one support surface will be able to be executed by the generic runtime fusion engine. The following table lists a summary of the features in each support surface. For best performance, we recommend targeting the highest indexed support surface possible and fall back to lower indexed support surfaces if needed.

Summary of Supported Features of Each Support Surface
Feature	Support Surface 90	Support Surface 80	Support Surface 70
Compute Capability	>= 9.0	>= 8.0	>= 7.0
`Matmul` Fusions	Supported	Supported	Supported
`ConvolutionFwd` Fusions	Supported	Supported	Supported
`ConvolutionBwdFilter` Fusions	Supported	Not Supported	Supported
`ConvolutionBwdData` Fusions	Partially Supported	Not Supported	Supported
`Pointwise` and `Reduction` Fusions	Not Supported	Not Supported	Supported
FP8 `Matmul` and `Convolution` Operations	Supported	Supported for Compute Capability >= 8.9	Not Supported
g ₁ (Mainloop) Fusions	Supported	Supported	Partially Supported
g ₂ (Epilogue) Fusions	Supported	Supported	Supported
Mixed Input Precision Matmul/Convolution	Supported	Supported	Not Supported
Grouped Convolution	Supported	Supported	Not Supported

The detailed supported features of each support surface are listed in the following subsections.

Support Surface 90

Compute Capability

NVIDIA GPUs with compute capability 9.0 are supported.

Generic Limitations

Strided ConvolutionBwdData fusions are not supported.
Pointwise and Reduction fusions are not supported.

Advanced Matmul/Convolution Variations

Mixed input precision Matmul, ConvolutionFwd, and ConvolutionBwdData fusions are supported.
Grouped ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData fusions are supported.

I/O and Intermediate Data Type

The input tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.
The input tensor data type of Matmul, ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData operations can be any of {FLOAT, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.
The output tensor data type can be any of {INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2}.
The output data types of CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations can only be FLOAT.
The intermediate virtual tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}, and this intermediate storage type is obeyed by the code-generator. Generally, FP32 is recommended.

Compute Data Type

Compute data type can be either FP32 or INT32 for CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.
Compute data type can only be FP32 for CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations.
The support surface of the compute data type of Matmul, ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.

Combinatory Support Surface of Input Data Type and Compute Data Type of `Matmul` and `Convolution` Operations
`matmul` / `convolution` Operation Input Data Type	`matmul` / `convolution` Operation Compute Data Type
`INT8`	`INT32`
`FP8_E4M3`, `FP8_E5M2`	`FLOAT`, `FAST_FLOAT_FOR_FP8`
`HALF`	`FLOAT`, `HALF`
`BFLOAT16`	`FLOAT`
`FLOAT`	`FLOAT`

Mainloop Fusions: g :sub:`1`

g ₁ is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.
All the input tensors must have an alignment of 128 bits. For grouped ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData fusions, the alignment requirement is per group.
All the intermediate tensors must be virtual.
The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g ₁
Pattern	Dimension	Layout
`Matmul` fusions	Tensor A must have dimension `dim[B, M, K] or dim[1, M, K]`. Dimensions of input tensors to g _1A can be `dim[1, 1, 1]`, `dim[B, 1, 1]`, `dim[1, M, 1]`, `dim[B, M, 1]`, `dim[1, 1, K]`, `dim[B, 1, K]`, `dim[1, M, K]`, or `dim[B, M, K]`. Tensor B must have dimension `dim[B, K, N] or dim[1, K, N]`. Dimensions of input tensors to g _1B can be `dim[1, 1, 1]`, `dim[B, 1, 1]`, `dim[1, 1, N]`, `dim[B, 1, N]`, `dim[1, K, 1]`, `dim[B, K, 1]`, `dim[1, K, N]`, or `dim[B, K, N]`.	All tensors can be in either fully packed row-major layout or fully packed column-major layout.
`ConvolutionFwd` fusions	Tensor X must have dimension `dim[N, C, (D,) H, W]`. Dimensions of tensors operating with X can be `dim[1, 1, (1,) 1, 1]`, `dim[1, C, (1,) 1, 1]`, or `dim[N, C, (D), H, W]`. Exception: `dim[1, C, (1,) 1, 1]` is not compatible in grouped `ConvolutionFwd` fusions. Tensor W must have dimension `dim[K, C, (T,) R, S]`.	All tensors must be in a fully packed NHWC layout.
`ConvolutionBwdFilter` fusions	Tensor dy must have dimension `dim[N, K, (O,) P, Q]`. Tensor X must have dimension `dim[N, C, (D,) H, W]`. Fusion operation in g ₁ is not supported.	All tensors can be in either fully packed NHWC layout or fully packed CHWN layout. In general, both dy and X tensors being in CHWN layout can have the best performance.
`ConvolutionBwdData` fusions	Tensor dy must have dimension `dim[N, K, (O,) P, Q]`. Dimensions of tensors operating with dy can be `dim[1, 1, (1,) 1, 1]`, `dim[1, K, (1,) 1, 1]`, or `dim[N, K, (O,) P, Q]`. Tensor W must have dimension `dim[K, C, (T,) R, S]`.	All tensors can be in either fully packed NHWC layout or fully packed CHWN layout. In general, dy tensor being in NHWC layout and W tensor being in CHWN layout can have the best performance.

Epilogue Fusion: g :sub:`2`

g ₂ is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation and zero or one CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.
CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation can only be the exit node of g ₂.
All the input and output tensors must have an alignment of 8 bits. For grouped ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData fusions, the alignment requirement is per group.
In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.
The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g ₂
Pattern	Dimension	Layout
`Matmul` fusions	Tensor C must have dimension `dim[B, M, N]`. Dimensions of other input tensors to g ₂ can be `dim[1, 1, 1]`, `dim[B, 1, 1]`, `dim[1, M, 1]`, `dim[B, M, 1]`, `dim[1, 1, N]`, `dim[B, 1, N]`, `dim[1, M, N]`, or `dim[B, M, N]`. Dimensions of output tensors from g ₂ can be `dim[B, M, N]`. If the last operation in g ₂ is `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR`, the dimension of the last output tensor can be `dim[1, 1, 1]`, `dim[B, 1, 1]`, `dim[1, M, 1]`, `dim[B, M, 1]`, `dim[1, 1, N]`, `dim[B, 1, N]`, or `dim[1, M, N]`.	All tensors can be in either fully packed row-major layout or fully packed column-major layout, but need to be the same. If `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR` operation presents, all tensor layouts must be row-major.
`ConvolutionFwd` fusions	Tensor Y must have dimension `dim[N, K, (O,) P, Q]`. Dimensions of other input tensors to g ₂ can be `dim[1, 1, (1,) 1, 1]`, `dim[N, 1, (O,) P, Q]`, `dim[1, K, (1,) 1, 1]`, or `dim[N, K, (O,) P, Q]`. Dimensions of output tensors from g ₂ can be `dim[N, K, (O,) P, Q]`. If the last operation in g ₂ is `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR`, the dimension of the last output tensor can be `dim[1, 1, (1,) 1, 1]`, `dim[N, 1, (O,) P, Q]`, `dim[1, K, (1,) 1, 1]`, `dim[N, K, (1,) 1, 1]`, or `dim[N, 1, (1,) 1, 1]`. Grouped `ConvolutionFwd` fusions cannot have `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR` operation in g ₂.	All tensors must be in a fully packed NHWC layout.
`ConvolutionBwdFilter` fusions	Tensor dw must have dimension `dim[K, C, (T,) R, S]`. Dimensions of other input tensors to g ₂ can be `dim[1, 1, (1,) 1, 1]`, `dim[1, C, (T,) R, S]`, `dim[K, 1, (1,) 1, 1]`, or `dim[K, C, (T,) R, S]`. Dimensions of output tensors from g ₂ can be `dim[K, C, (T,) R, S]`. If the last operation in g ₂ is `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR`, the dimension of the last output tensor can be `dim[1, 1, (1,) 1, 1]`, `dim[1, C, (T,) R, S]`, or `dim[K, 1, (1,) 1, 1]`. Grouped `ConvolutionBwdFilter` fusions cannot have `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR` operation in g ₂.	All tensors must be in a fully packed NHWC layout.
`ConvolutionBwdData` fusions	Tensor dx must have dimension `dim[N, C, (D,) H, W]`. Dimensions of other input tensors to g ₂ can be `dim[1, 1, (1,) 1, 1]`, `dim[N, 1, (D,) H, W]`, `dim[1, C, (1,) 1, 1]`, or `dim[N, C, (D,) H, W]`. Dimensions of output tensors from g ₂ can be `dim[N, C, (D,) H, W]`. If the last operation in g ₂ is `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR`, the dimension of the last output tensor can be `dim[1, 1, (1,) 1, 1]`, `dim[N, 1, (D,) H, W]`, `dim[1, C, (1,) 1, 1]`, `dim[N, C, (1,) 1, 1]`, or `dim[N, 1, (1,) 1, 1]`. Grouped `ConvolutionBwdData` fusions cannot have g ₂.	All tensors must be in a fully packed NHWC layout.
`Pointwise` and `Reduction` fusions	Not Supported	Not Supported

Support Surface 80

Compute Capability

NVIDIA GPUs with compute capability 8.0, 8.6. 8.7, 8.9, and 9.0 are supported.

Generic Limitations

ConvolutionBwdFilter fusions are not supported.
ConvolutionBwdData fusions are not supported.
Pointwise and Reduction fusions are not supported.

Advanced Matmul/Convolution Variations

Mixed input precision Matmul and ConvolutionFwd fusions are supported.
Grouped ConvolutionFwd fusions are supported.

I/O and Intermediate Data Type

The input tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.
The input tensor data type of Matmul and ConvolutionFwd operations can be any of {FLOAT, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}. FP8_E4M3 and FP8_E5M2 input tensor data type of Matmul and ConvolutionFwd operations are only available with compute capability 8.9.
The output tensor data type can be any of {INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2}.
The output data types of CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations can only be FLOAT.
The intermediate virtual tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}, and this intermediate storage type is obeyed by the code-generator. Generally, FP32 is recommended.
FP8_E4M3 and FP8_E5M2 input, output, and intermediate tensor data type is only available with compute capability 8.9 and 9.0.

Compute Data Type

Compute data type can be either FP32 or INT32 for CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.
Compute data type can only be FP32 for CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations.
The support surface of the compute data type of Matmul, ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.

Combinatory Support Surface of Input Data Type and Compute Data Type of `Matmul` and `Convolution` Operations
`matmul` / `convolution` Operation Input Data Type	`matmul` / `convolution` Operation Compute Data Type	Note
`INT8`	`INT32`
`FP8_E4M3`, `FP8_E5M2`	`FLOAT`, `FAST_FLOAT_FOR_FP8`	Available with compute capability 8.9 only
`HALF`	`FLOAT`, `HALF`
`BFLOAT16`	`FLOAT`
`FLOAT`	`FLOAT`

Mainloop Fusions: g :sub:`1`

g ₁ is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation.
All the input tensors must have an alignment of 32 bits. For ConvolutionFwd fusions with no operations in g ₁, the input tensors can have an alignment of 8 bits. For grouped ConvolutionFwd fusions, the alignment requirement is per group.
All the intermediate tensors must be virtual.
The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g ₁
Pattern	Dimension	Layout
`Matmul` fusions	Tensor A must have dimension `dim[B, M, K]` or `dim[1, M, K]`. Dimensions of input tensors to g _1A can be `dim[1, 1, 1]`, `dim[B, 1, 1]`, `dim[1, M, 1]`, `dim[B, M, 1]`, `dim[1, 1, K]`, `dim[B, 1, K]`, `dim[1, M, K]`, or `dim[B, M, K]`. Tensor B must have dimension `dim[B, K, N]` or `dim[1, K, N]`. Dimensions of input tensors to g _1B can be `dim[1, 1, 1]`, `dim[B, 1, 1]`, `dim[1, 1, N]`, `dim[B, 1, N]`, `dim[1, K, 1]`, `dim[B, K, 1]`, `dim[1, K, N]`, or `dim[B, K, N]`.	All tensors can be in either fully packed row-major layout or fully packed column-major layout
`ConvolutionFwd` fusions	Tensor X must have dimension `dim[N, C, (D,) H, W]`. Dimensions of tensors operating with X can be `dim[1, 1, (1,) 1, 1]`, `dim[1, C, (1,) 1, 1]`, or `dim[N, C, (D), H, W]`. Exception: `dim[1, C, (1,) 1, 1]` is not compatible in grouped `ConvolutionFwd` fusions. Tensor W must have dimension `dim[K, C, (T,) R, S]`.	All tensors must be in a fully packed NHWC layout.
`ConvolutionBwdFilter` fusions	Not Supported	Not Supported
`ConvolutionBwdData` fusions	Not Supported	Not Supported

Epilogue Fusion: g :sub:`2`

g ₂ is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation and zero or one CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.
CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation can only be the exit node of g ₂.
All the input and output tensors must have an alignment of 8 bits. For grouped ConvolutionFwd fusions, the alignment requirement is per group.
In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.
The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g ₂
Pattern	Dimension	Layout
`Matmul` fusions	Tensor C must have dimension `dim[B, M, N]`. Dimensions of other input tensors to g ₂ can be `dim[1, 1, 1]`, `dim[B, 1, 1]`, `dim[1, M, 1]`, `dim[B, M, 1]`, `dim[1, 1, N]`, `dim[B, 1, N]`, `dim[1, M, N]`, or `dim[B, M, N]`. Dimensions of output tensors from g ₂ can be `dim[B, M, N]`. If the last operation in g ₂ is `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR`, the dimension of the last output tensor can be `dim[1, 1, 1]`, `dim[B, 1, 1]`, `dim[1, M, 1]`, `dim[B, M, 1]`, `dim[1, 1, N]`, `dim[B, 1, N]`, or `dim[1, M, N]`.	All tensors can be in either fully packed row-major layout or fully packed column-major layout, but need to be the same. If `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR` operation presents, all tensor layouts must be row-major.
`ConvolutionFwd` fusions	Tensor Y must have dimension `dim[N, K, (O,) P, Q]`. Dimensions of other input tensors to g ₂ can be `dim[1, 1, (1,) 1, 1]`, `dim[N, 1, (O,) P, Q]`, `dim[1, K, (1,) 1, 1]`, or `dim[N, K, (O,) P, Q]`. Dimensions of output tensors from g ₂ can be `dim[N, K, (O,) P, Q]`. If the last operation in g ₂ is `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR`, the dimension of the last output tensor can be `dim[1, 1, (1,) 1, 1]`, `dim[N, 1, (O,) P, Q]`, `dim[1, K, (1,) 1, 1]`, or `dim[N, 1, (1,) 1, 1]`. Grouped `ConvolutionFwd` fusions cannot have `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR` operation in g ₂.	All tensors must be in a fully packed NHWC layout.
`ConvolutionBwdFilter` fusions	Not Supported	Not Supported
`ConvolutionBwdData` fusions	Not Supported	Not Supported
`Pointwise` and `Reduction` fusions	Not Supported	Not Supported

Support Surface 70

Support Surface of Compute Capability

NVIDIA GPUs with compute capability 7.0, 7.2, 7.5, 8.0, 8.6, 8.7, 8.9, and 9.0 are supported.

I/O and Intermediate Data Type

The input tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.
The input tensor data type of Matmul, ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData operations can be any of {FLOAT, HALF, BFLOAT16, INT8}.
The output tensor data type can be any of {INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2, BOOLEAN}.
The output data types of CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations can only be FLOAT.
The intermediate virtual tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2, BOOLEAN}, and this intermediate storage type is obeyed by the code-generator. Generally, FP32 is recommended.
FP8_E4M3 and FP8_E5M2 data types are only allowed in pure Pointwise and Reduction fusions.

Compute Data Type

Compute data type can be FP32 or BOOLEAN for CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.
Compute data type can only be FP32 for CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations.
The support surface of the compute data type of Matmul, ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.

Combinatory Support Surface of Input Data Type and Compute Data Type of `Matmul` and `Convolution` Operations
`Matmul` / `Convolution` Operation Input Data Type	`Matmul` / `Convolution` Operation Compute Data Type	Note
`INT8`	`INT32`	Not available for `ConvolutionBwdFilter` and `ConvolutionBwdData` fusions
`HALF`	`FLOAT`, `HALF`	Available with compute capability 8.9 only
`HALF`	`FLOAT`, `HALF`
`BFLOAT16`	`FLOAT`
`FLOAT`	`FLOAT`

Mainloop Fusions: g :sub:`1`

g ₁ is a directed acyclic graph (DAG) that can consist of zero or any number of the following operations:
- CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
- CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR
- CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR
CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR or CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR operations, if present, should be before any Pointwise operations.
For compute capability < 8.0, g ₁ is not supported.
All the input tensors must have an alignment of 32 bits.
All the intermediate tensors must be virtual.
In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.
The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g ₁
Pattern	Dimension	Layout
`Matmul` fusions	Tensor A must have dimension `dim[B, M, K]`. If g ₁ presents, Tensor A must be in `HALF` data type, the other input tensors broadcasted and operated with tensor A can have any data type. Dimensions of other input tensors to g _1A can be `dim[1, 1, 1]`, `dim[B, M, 1]`, `dim[B, 1, K]`, or `dim[B, M, K]`. If the input tensor is in dimension `dim[B, M, K]`, it must have data type `HALF` as well. Tensor B must have dimension `dim[B, K, N]`. Fusion operation in g _1B is not supported.	All tensors can be in either fully packed row-major layout or fully packed column-major layout. All input tensors to g _1A must have the same layout. If the data type of the input tensor to the `matmul` operation is INT8, all the input tensors to g _1A must be in row-major layout and the tensor B must be in column-major layout.
`ConvolutionFwd` fusions	Tensor X must have dimension `dim[N, C, (D,) H, W]`. Tensor W must have dimension `dim[K, C, (T,) R, S]`. Fusion operations on X tensor can be only a chain of three specific `Pointwise` operations, in this exact order: `CUDNN_POINTWISE_MUL`, `CUDNN_POINTWISE_ADD`, and `CUDNN_POINTWISE_RELU_FWD`. This specific support is added to realize convolution batch norm fusion use cases. All tensors involved can only be `HALF` data type. `CUDNN_POINTWISE_MUL` and `CUDNN_POINTWISE_ADD` can only be operated with a tensor with dimension `dim[1, C, (1,) 1, 1]`.	All tensors must be in a fully packed NHWC layout.
`ConvolutionBwdFilter` fusions	Tensor dy must have dimension `dim[N, K, (O,) P, Q]`. Tensor X must have dimension `dim[N, C, (D,) H, W]`. Fusion operations on X tensor can be only a chain of three specific `Pointwise` operations, in this exact order: `CUDNN_POINTWISE_MUL`, `CUDNN_POINTWISE_ADD`, and `CUDNN_POINTWISE_RELU_FWD`. This specific support is added to realize convolution batch norm fusion use cases. All tensors involved can only be `HALF` data type. `CUDNN_POINTWISE_MUL` and `CUDNN_POINTWISE_ADD` can only be operated with a tensor with dimension `dim[1, C, (1,) 1, 1]`.	All tensors must be in a fully packed NHWC layout.
`ConvolutionBwdData` fusions	Tensor dy must have dimension `dim[N, K, (O,) P, Q]`. Tensor W must have dimension `dim[K, C, (T,) R, S]`. Fusion operation in g ₁ is not supported.	All tensors must be in a fully packed NHWC layout.

Epilogue Fusion: g :sub:`2`

g ₂ is a directed acyclic graph (DAG) that can consist of zero or any number of the following operations:
- CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
- CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR
- CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR
- CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR
- CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR

and zero or one CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.

CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation can only be the exit node of g ₂.
CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR operations, if present, must be the final nodes in g ₂. Hence, CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR operations cannot be used in conjunction with the CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.
The input tensor to a CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR or CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR operation should not be produced by another operation within this graph, but should come from global memory. These two operations cannot be used in the Matmul, ConvolutionBwdFilter, and ConvolutionBwdData fusions, and are only supported with compute capability >= 7.5.
All the input and output tensors must have an alignment of 32 bits, except the output of a CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation can have an alignment of 8 bits.
In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.
The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g ₂
Pattern	Dimension	Layout
`Matmul` fusions	Tensor C must have dimension `dim[B, M, N]`. Dimensions of other input tensors to g ₂ can be `dim[1, 1, 1]`, `dim[B, M, 1]`, `dim[B, 1, N]`, or `dim[B, M, N]`. Dimensions of output tensors from g ₂ can be `dim[B, M, N]`. If the last operation in g ₂ is `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR`, the dimension of the last output tensor can be `dim[1, 1, 1]`, `dim[B, M, 1]`, or `dim[B, 1, N]`.	All tensors can be in either fully packed row-major layout or fully packed column-major layout, but need to be the same. If `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR` operation presents, all tensor layouts must be row-major.
`ConvolutionFwd` fusions	Tensor Y must have dimension `dim[N, K, (O,) P, Q]`. Dimensions of other input tensors to g ₂ can be `dim[1, 1, (1,) 1, 1]`, `dim[N, 1, (O,) P, Q]`, `dim[1, K, (1,) 1, 1]`, or `dim[N, K, (O,) P, Q]`. Dimensions of output tensors from g ₂ can be `dim[N, K, (O,) P, Q]`. If the last operation in g ₂ is `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR`, the dimension of the last output tensor can be `dim[1, 1, (1,) 1, 1]`, `dim[N, 1, (O,) P, Q]`, or `dim[1, K, (1,) 1, 1]`.	All tensors must be in a fully packed NHWC layout.
`ConvolutionBwdFilter` fusions	Tensor dw must have dimension `dim[K, C, (T,) R, S]`. Dimensions of other input tensors to g ₂ can be `dim[1, 1, (1,) 1, 1]`, `dim[1, C, (T,) R, S]`, `dim[K, 1, (1,) 1, 1]`, or `dim[K, C, (T,) R, S]`. Dimensions of output tensors from g ₂ can be `dim[K, C, (T,) R, S]`. If the last operation in g ₂ is `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR`, the dimension of the last output tensor can be `dim[1, 1, (1,) 1, 1]`, `dim[1, C, (T,) R, S]`, or `dim[K, 1, (1,) 1, 1]`.	All tensors must be in a fully packed NHWC layout.
`ConvolutionBwdData` fusions	Tensor dx must have dimension `dim[N, C, (D,) H, W]`. Dimensions of other input tensors to g ₂ can be `dim[1, 1, (1,) 1, 1]`, `dim[N, 1, (D,) H, W]`, `dim[1, C, (1,) 1, 1]`, or `dim[N, C, (D,) H, W]`. Dimensions of output tensors from g ₂ can be `dim[N, C, (D,) H, W]`. If the last operation in g ₂ is `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR`, the dimension of the last output tensor can be `dim[1, 1, (1,) 1, 1]`, `dim[N, 1, (D,) H, W]`, or `dim[1, C, (1,) 1, 1]`.	All tensors must be in a fully packed NHWC layout.
`Pointwise` and `Reduction` fusions	If all tensors are 3D, the same dimension requirements as `Matmul` g ₂. If all tensors are 4D or 5D, the same dimension requirements as `ConvolutionFwd`, `ConvolutionBwdFilter`, and `ConvolutionBwdData` g ₂. `CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR` operation does not support 3D tensors.	If all tensors are 3D, the same layout requirements as `Matmul` g ₂. If all tensors are 4D or 5D, the same layout requirements as `ConvolutionFwd`, `ConvolutionBwdFilter`, and `ConvolutionBwdData` g ₂.

Operation Specific Constraints for the Runtime Fusion Engines

Every operation in the supported generic patterns of the runtime fusion engines is subject to a few specific constraints regarding their parameter surface. The following subsections document these.

Note that these constraints are in addition to (1) any constraints mentioned in the Backend Descriptor Types, and (2) limitations in relation to other operations in the directed acyclic graph (DAG), as mentioned in the Support Surface section.

Matmul

This operation represents matrix-matrix multiplication: A * B = C. For complete details on the interface, refer to the CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR section.

Convolutions

There are three operation nodes that represent different types of convolutions namely:

ConvolutionFwd

This operation represents forward convolution, that is, computing the response tensor of image tensor convoluted with filter tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR section.

ConvolutionBwdFilter

This operation represents convolution backward filters, that is, computing filter gradients from a response and an image tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR section.

ConvolutionBwdData

This operation represents convolution backward data, that is, computing input data gradients from a response and a filter tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR section.

Tensor Attributes for all Three Operations
	Input Tensor Attribute Name	Output Tensor Attribute Name
`ConvolutionFwd`	`CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X`, `CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W`	`CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y`
`ConvolutionBwdFilter`	`CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX`, `CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY`	`CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W`
`ConvolutionBwdData`	`CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW`, `CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY`	`CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X`

The FP8 data type since NVIDIA Ada Lovelace architecture has two variants: CUDNN_DATA_FP8_E4M3 and CUDNN_DATA_FP8_E5M2 as I/O data types. Using them as inputs to the operation will result in FP8 Tensor Cores being used. The precision of the accumulation inside the FP8 Tensor Cores is controlled by the compute type, which may have one of two possible values: CUDNN_DATA_FLOAT and CUDNN_DATA_FAST_FLOAT_FOR_FP8.

CUDNN_DATA_FAST_FLOAT_FOR_FP8 is faster and sufficiently accurate for inference or the forward pass of training. However, for FP8 training backward pass computations (that is, computing weight and activation gradients), we recommend choosing the more accurate CUDNN_DATA_FLOAT compute type to preserve a higher level of accuracy which may be necessary for some models.

Recommended Compute Type for FP8 Tensor Computations for Ada Lovelace and Hopper Architecture
Operation	Recommended I/O Type	Recommended Compute Type
`ConvolutionFwd`	`CUDNN_DATA_FP8_E4M3`	`CUDNN_DATA_FAST_FLOAT_FOR_FP8`, `CUDNN_DATA_FLOAT`
`ConvolutionBwdData`	`CUDNN_DATA_FP8_E4M3`	`CUDNN_DATA_FLOAT`
`BatchNorm`	`CUDNN_DATA_FP8_E4M3`	`CUDNN_DATA_FLOAT`
`Pooling`	`CUDNN_DATA_FP8_E4M3`, `CUDNN_DATA_FP8_E5M2`	`CUDNN_DATA_FLOAT`
`Pointwise`	`CUDNN_DATA_FP8_E4M3`, `CUDNN_DATA_FP8_E5M2`	`CUDNN_DATA_FLOAT`

Pointwise

Represents a pointwise operation that implements the equation Y = op (alpha1 * X) or Y = op (alpha1 * X, alpha2 * B). Refer to the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR and CUDNN_BACKEND_POINTWISE_DESCRIPTOR sections for more information and general constraints.

The following tables list the constraints for Pointwise operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when these operations are used in the runtime fusion engines.

Constraints for `Pointwise` Operations for Support Surface 90 and 80
Attribute	Requirement
Tensor data type for `CUDNN_ATTR_OPERATION_POINTWISE_XDESC`, `CUDNN_ATTR_OPERATION_POINTWISE_YDESC` and, if applicable, `CUDNN_ATTR_OPERATION_POINTWISE_BDESC`, `CUDNN_ATTR_OPERATION_POINTWISE_TDESC`	For all operators, all data types are supported.
`CUDNN_ATTR_POINTWISE_MATH_PREC`	For any of the logical operators (`CUDNN_POINTWISE_LOGICAL_AND`, `CUDNN_POINTWISE_LOGICAL_OR`, and `CUDNN_POINTWISE_LOGICAL_NOT`), math precision needs to be `CUDNN_DATA_INT32`. For any of the following operations (`CUDNN_POINTWISE_ADD`, `CUDNN_POINTWISE_ADD_SQUARE`, `CUDNN_POINTWISE_DIV`, `CUDNN_POINTWISE_MAX`, `CUDNN_POINTWISE_MIN`, `CUDNN_POINTWISE_MOD`, `CUDNN_POINTWISE_ABS`, `CUDNN_POINTWISE_CEIL`, `CUDNN_POINTWISE_FLOOR`, `CUDNN_POINTWISE_MUL`, `CUDNN_POINTWISE_SUB`, `CUDNN_POINTWISE_NEG`, `CUDNN_POINTWISE_CMP_EQ`, `CUDNN_POINTWISE_CMP_NEQ`, `CUDNN_POINTWISE_CMP_GT`, `CUDNN_POINTWISE_CMP_GE`, `CUDNN_POINTWISE_CMP_LT`, `CUDNN_POINTWISE_CMP_LE`, `CUDNN_POINTWISE_GEN_INDEX`, `CUDNN_POINTWISE_BINARY_SELECT`), math precision can be either `CUDNN_DATA_FLOAT` or `CUDNN_DATA_INT32`. For any of the `CUDNN_POINTWISE_IDENTITY` operations, the math precision can be any data type. However, if the math precision is other than `CUDNN_DATA_INT32` or `CUDNN_DATA_FLOAT`, the input data type, the output data type, and the math precision must be the same. For all other operators, only `CUDNN_DATA_FLOAT` math precision is supported.
`CUDNN_ATTR_OPERATION_POINTWISE_ALPHA1`	`1.0f`
`CUDNN_ATTR_OPERATION_POINTWISE_ALPHA2`	`1.0f`

Constraints for `Pointwise` Operations for Support Surface 70
Attribute	Requirement
Tensor data type for `CUDNN_ATTR_OPERATION_POINTWISE_XDESC`, `CUDNN_ATTR_OPERATION_POINTWISE_YDESC` and, if applicable, `CUDNN_ATTR_OPERATION_POINTWISE_BDESC`	For any of the logical operators (`CUDNN_POINTWISE_LOGICAL_AND`, `CUDNN_POINTWISE_LOGICAL_OR`, and `CUDNN_POINTWISE_LOGICAL_NOT`), data type can be any of `CUDNN_DATA_INT32`, `CUDNN_DATA_INT8`, or `CUDNN_DATA_BOOLEAN`. For all other operators, all data types are supported.
`CUDNN_ATTR_POINTWISE_MATH_PREC`	For any of the logical operators (`CUDNN_POINTWISE_LOGICAL_AND`, `CUDNN_POINTWISE_LOGICAL_OR`, and `CUDNN_POINTWISE_LOGICAL_NOT`), math precision needs to be `CUDNN_DATA_BOOLEAN`. For all other operators, only `CUDNN_DATA_FLOAT` is supported.
`CUDNN_ATTR_OPERATION_POINTWISE_ALPHA1`	`1.0f`
`CUDNN_ATTR_OPERATION_POINTWISE_ALPHA2`	`1.0f`

GenStats

Represents an operation that generates per-channel statistics. Refer to the CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR section for more information and general constraints.

The following table lists the constraints for GenStats operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when GenStats operations are used in the runtime fusion engines.

Constraints for GenStats Operations
Attribute	Requirement
Tensor data type for `CUDNN_ATTR_OPERATION_GENSTATS_XDESC`	Prior to the NVIDIA Ampere architecture GPU: `CUDNN_DATA_HALF` On NVIDIA Ampere architecture and later: `CUDNN_DATA_HALF` and `CUDNN_DATA_FLOAT`
Tensor shape for `CUDNN_ATTR_OPERATION_GENSTATS_SUMDESC` and `CUDNN_ATTR_OPERATION_GENSTATS_SQSUMDESC`	Both should be of shape [1, C, 1, 1] for 2D conv or [1, C, 1, 1, 1] for 3D conv.
Tensor data type for `CUDNN_ATTR_OPERATION_GENSTATS_SUMDESC` and `CUDNN_ATTR_OPERATION_GENSTATS_SQSUMDESC`	`CUDNN_DATA_FLOAT`
`CUDNN_ATTR_POINTWISE_MATH_PREC`	`CUDNN_DATA_FLOAT`
Tensor layout for `CUDNN_ATTR_OPERATION_GENSTATS_XDESC`, `CUDNN_ATTR_OPERATION_GENSTATS_SUMDESC`, and `CUDNN_ATTR_OPERATION_GENSTATS_SQSUMDESC`	NHWC fully packed

Reduction

This operation represents reducing values of a tensor in one or more dimensions. Refer to the CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR section for more information and general constraints.

The following table lists constraints for Reduction forward operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when Reduction operations are used in the runtime fusion engines.

Constraints for Reduction Operations
Attribute	Requirement
Tensor data type for `CUDNN_ATTR_OPERATION_REDUCTION_YDESC`	`CUDNN_DATA_FLOAT`
`CUDNN_ATTR_REDUCTION_COMP_TYPE`	`CUDNN_DATA_FLOAT`
Tensor layout for `CUDNN_ATTR_OPERATION_REDUCTION_XDESC` and `CUDNN_ATTR_OPERATION_REDUCTION_YDESC`	NHWC/NDHWC/BMN fully packed
`CUDNN_ATTR_REDUCTION_OPERATOR`	`CUDNN_REDUCE_TENSOR_ADD`, `CUDNN_REDUCE_TENSOR_MIN`, and `CUDNN_REDUCE_TENSOR_MAX`

ResampleFwd

This operation represents resampling of the spatial dimensions of an image to a desired value. Resampling is supported in both directions, upsampling and downsampling. Downsampling represents the standard operation of pooling, commonly used in convolutional neural networks. Refer to the CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR section for more information and general constraints.

The following are constraints for ResampleFwd operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when ResampleFwd operations are used in the runtime fusion engines.

We allow a choice amongst four modes for resample. All modes have the following common support specifications:

Supported layout: NHWC or NDHWC, NCHW or NCDHW

Spatial dimensions supported: 2 or 3

Input dimensions supported: 4 or 5

Packed boolean data type is not supported.

If specified, the index tensor dimension should be equal to the response tensor dimension.

When the tensor format is NCHW/NCDHW, the following additional restrictions apply:

Upsampling is not supported.

Int64_t indices are not supported.

Only supports symmetric padding using the prepadding backend API.

There are some mode specific restrictions also. The following tables list the values that are allowed for particular parameters. For the parameters not listed, we allow any value which is mathematically correct.

The following downsampling modes are supported:

CUDNN_RESAMPLE_AVGPOOL_INCLUDE_PADDING

CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING

CUDNN_RESAMPLE_MAXPOOL

Specific Restrictions for the Downsampling Modes
Attribute	Average Pooling	Max Pooling
`CUDNN_ATTR_RESAMPLE_PADDING_MODE`	`CUDNN_ZERO_PAD`	`CUDNN_NEG_INF_PAD`
`CUDNN_ATTR_OPERATION_RESAMPLE_FWD_ALPHA`	`1.0`	`1.0`
`CUDNN_ATTR_OPERATION_RESAMPLE_FWD_BETA`	`0.0`	`0.0`
`CUDNN_ATTR_RESAMPLE_COMP_TYPE`	`CUDNN_DATA_FLOAT`	`CUDNN_DATA_FLOAT`

For the upsampling modes, CUDNN_RESAMPLE_NEAREST is not supported for any combination of parameters. CUDNN_RESAMPLE_BILINEAR has the following support specifications.

Specific Restrictions for Upsampling Mode `CUDNN_RESAMPLE_BILINEAR`
Attribute	Bilinear
Input dimensions	Equal to `0.5 x` output dimensions
`CUDNN_ATTR_RESAMPLE_PRE_PADDINGS`	`0.5`
`CUDNN_ATTR_RESAMPLE_POST_PADDINGS`	`1`
`CUDNN_ATTR_RESAMPLE_STRIDES`	`0.5`
`CUDNN_ATTR_RESAMPLE_WINDOW_DIMS`	`2`
Data type for `CUDNN_ATTR_OPERATION_RESAMPLE_FWD_XDESC` and `CUDNN_ATTR_OPERATION_RESAMPLE_FWD_YDESC`	`CUDNN_DATA_FLOAT`
`CUDNN_ATTR_RESAMPLE_COMP_TYPE`	`CUDNN_DATA_FLOAT`
`CUDNN_ATTR_OPERATION_RESAMPLE_FWD_ALPHA`	`1.0`
`CUDNN_ATTR_OPERATION_RESAMPLE_FWD_BETA`	`0.0`
`CUDNN_ATTR_RESAMPLE_PADDING_MODE`	`CUDNN_EDGE_VAL_PAD`

Resampling Index Tensor Dump for Training

For max-pooling resampling mode, an index tensor can be provided to be used as a mask for backpropagation.

Values in the index tensors are:

Zero-indexed row-major position of maximum value of input tensor in the resampling window.

In case of multiple input pixels with maximum value, the first index in a left-to-right top-to-bottom scan is selected.

An example of index element selection:

Select an appropriate element size for the index tensor. As a reference, any element size such that the maximum zero-indexed window position fits should be sufficient.

ResampleBwd

This operation represents backward resampling of the spatial dimensions of an output response to a desired value. Resampling is supported in both directions, upsampling and downsampling. Backwards downsampling represents the standard operation of backward pooling, commonly used in convolutional neural networks. Refer to the CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR section for more information and general constraints.

The following are constraints for Resample backward operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when Resample backward operations are used in the runtime fusion engines.

We allow a choice amongst four modes for resample. All modes have the following common support specifications:

Supported layout: NHWC or NDHWC, NCHW or NCDHW

Spatial dimensions supported: 2 or 3

Input dimensions supported: 4 or 5

For layout NHWC or NDHWC:

The index tensor should be provided for only max pooling mode, and should adhere to the format described in the Resampling Index Tensor Dump for Training section.

The index tensor dimensions should be equal to the input gradient tensor dimensions.

For layout NCHW or NCDHW:

X, Y, and DY are required when max pooling mode is used.

Int64_t indices are not supported.

There are some mode specific restrictions also. The following tables list the values that are allowed for particular parameters. For the parameters not listed, we allow any value which is mathematically correct.

The following backward downsampling modes are supported:

CUDNN_RESAMPLE_AVGPOOL_INCLUDE_PADDING

CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING

CUDNN_RESAMPLE_MAXPOOL

Specific Restrictions for the Backwards Downsampling Modes
Attribute	Average Pooling	Max Pooling
`CUDNN_ATTR_RESAMPLE_PADDING_MODE`	`CUDNN_ZERO_PAD`	`CUDNN_NEG_INF_PAD`
`CUDNN_ATTR_OPERATION_RESAMPLE_BWD_ALPHA`	`1.0`	`1.0`
`CUDNN_ATTR_OPERATION_RESAMPLE_BWD_BETA`	`0.0`	`0.0`
`CUDNN_ATTR_RESAMPLE_COMP_TYPE`	`CUDNN_DATA_FLOAT`	`CUDNN_DATA_FLOAT`

Backward upsampling modes are currently not supported.

Examples of Supported Patterns

The following sections provide examples of supported patterns, in order of increasing complexity. We employ the same color scheme as in the overall pattern to aid in identifying the structure of g ₁ (blue) and g ₂ (purple).

For illustration purposes, we abbreviated the operations used. For a full mapping to the actual backend descriptors, refer to the Mapping with Backend Descriptors.

Single Operation

The following example illustrates a convolution operation without any operations before or after it. This means, g ₁ and g ₂, are empty graphs.

Pointwise Operations After Convolution 1

In this example, g ₂ consists of a sequential set of two Pointwise operations after the convolution.

ConvolutionFwd Followed by a DAG with Two Operations

Pointwise Operations After Convolution 2

Similar to the previous example, g ₂ consists of a sequential set of multiple Pointwise operations.

ConvolutionFwd Followed by a DAG with Three Operations

Pointwise Operations Before Matrix Multiplication

Pointwise operations can also precede a convolution or matrix multiplication, that is, g ₁ is composed of Pointwise operations.

Matmul Preceded by a DAG with Two Operations

Convolution Producer Node in Middle of DAG

The following pattern shows g ₁ as a DAG of Pointwise operations feeding into a convolution. In addition, g ₂ is a DAG consisting of two Pointwise operations. Note that the convolution is being consumed in the middle of g ₂ as opposed to g ₂ first node. This is a valid pattern.

Fusion of Operations Before and After the ``ConvolutionFwd`` operation. In addition, we Observe that the Output of ``ConvolutionFwd`` can Feed Anywhere in g :sub:`2`.

Mixed Input Precision Matmul and Convolution

Mixed input precision for matmuls and convolutions is implemented as a special case of mainloop fusion. Inputs may have different data types and will be converted to the desired data types serving as the inputs of the matmul or convolution operation by a Pointwise:Identity operation. The following pattern shows g ₁ as a DAG of Pointwise:Identity operation converting the input data type of tensor A into a matmul operation. This is a valid pattern.

Specialized Runtime Fusion Engines

The specialized runtime fusion engines target and optimize specialized graph patterns that commonly occur in popular deep learning models. These engines offer limited flexibility regarding supported fusion patterns, supported data types, and supported tensor layouts. Long term, these patterns are expected to be more generic.

The following sections highlight the supported patterns.

BnAddRelu

In ResNet-like vision models, batch normalization followed by ReLU activation is a commonly occurring pattern. The BNAddRelu fusion pattern, supported using a runtime compiled engine, aims to optimize this recurring operation graph. It also supports single node multi-GPU batch normalization for speeding up batch norm computation in multi-GPU systems. The pattern is intended for use in the forward pass during the training phase. The full pattern BNAddRelu with the add node is used in cases where there are skip connections in the model.

The pattern is illustrated in the following diagram and its options and limitations include:

The pointwise nodes: Add, ReLU, and GT (greater than) are optional.

All tensors should be in NHWC packed layout format.

Both 4D and 5D tensors are supported.

Only ReLU activation is supported.

The attribute CUDNN_ATTR_OPERATION_NORM_FWD_MODE for the norm forward operation must be set to CUDNN_BATCH_NORM.

The attribute CUDNN_ATTR_OPERATION_NORM_FWD_PHASE for the norm forward operation must be set to CUDNN_NORM_FWD_TRAINING.

The batch norm input tensors: Scale, Bias, Input_running_mean, and Input_running_var must be of float data type.

The batch norm output tensors: output_running_mean, output_running_var, mean, and InvVariance must be of float data type.

The batch norm input tensor BN_x, residual input Z and output tensor Y can be any of {FP32, FP16, BF16} data types. For FP16 and BF16 data types, the channel count C for the tensors must be a multiple of 8 while for float data type the channel count must be a multiple of 4.

These patterns are supported on devices with compute capability >= 8.0.

In case of single node multi-GPU batch norm, each GPU computes the local statistics based on its input data and writes out the local statistics to the peerTensors. Each peerTensor resides on a separate GPU on the node and is used for reading and writing local statistics from the peer GPUs. This is followed by a global stats computation phase where each GPU aggregates the statistics from the peers and computes the global mean and variance for the batch norm output computation on its local data. Apart from the options and limitations listed above, the following additional restrictions apply for using multi-GPU batch norm:

The attribute CUDNN_ATTR_OPERATION_NORM_FWD_PEER_STAT_DESCS of the NormForward operation must be set.

The size of the peerTensors vector should be equal to the number of GPUs in the node participating in the batch norm computation.

The maximum size of the peerTensors vector is 32.

Each GPU should operate on the same size of input data [N,C,H,W].

The size of each peerTensor in the peerTensors vector should be equal to num_gpu * 4 * C where C is the channel count of the BN_x tensor and num_gpu is the number of GPUs in the node participating in the batch norm computation.

All the elements of each tensor in the peerTensors vector should be memset to 0 before passing that tensor in the variant pack.

DReluForkDBn

Similar to the BnAddRelu pattern, the DReluForkDBn pattern also targets ResNet-like vision networks. It is intended to be used in backpropagation during the training phase. The DReluForkDBn pattern is supported through a runtime compiled engine that usually complements the BnAddRelu pattern. It also supports single node multi-GPU batch normalization for speeding up batch norm backward computation in multi-GPU systems.

The pattern is illustrated in the following diagram and its options and limitations include:

The pointwise node dRelu is optional.

The intermediate tensor dZ can be virtual or non-virtual.

All tensors should be in NHWC packed layout format.

Both 4D and 5D tensors are supported.

Only dRelu activation is supported.

Bitmask tensor input is needed for the dRelu node.

The attribute CUDNN_ATTR_OPERATION_NORM_BWD_MODE for the norm backward operation must be set to CUDNN_BATCH_NORM.

The batch norm backward input tensors: Scale, Mean, InvVariance and the output tensors dScale and dBias must be of float data type.

dRelu input tensor dY, batch norm backward input BN_x, bias gradient dZ, and output tensor dX can be any of {FP32, FP16, BF16} data types. For FP16 and BF16 data types, the channel count C for the tensors must be a multiple of 8 while for float data type the channel count must be a multiple of 4.

These patterns are supported on devices with compute capability >= 8.0.

The single node multi-GPU version of this pattern is typically used for dScale and dBias gradient aggregation across GPUs. For using the multi-GPU version, the attribute CUDNN_ATTR_OPERATION_NORM_BWD_PEER_STAT_DESCS of the NormBackward operation must be set. Other restrictions for the peerTensors vector listed in the previous section apply for this pattern as well.

Fused Attention fprop

Mha-Fprop fusions \(O=matmul\left( S=g_{4} \left( P=matmul\left( Q, g_{3}\left( K \right) \right), V \right)\right)\) have been added to the runtime fusion engine to serve patterns that are commonly used in attention. These patterns can be used in BERT, T5, and so on.

There are two key differences to the flash fused attention patterns described in later sections:

Input sizes supported contain small sequence lengths (<= 512).
The operation graph is flexible to switch between different types of masks, different operations between the two matrix multiplications, and so on.

g ₃ can be an empty graph or a single scale operation with the scale being a scalar value (CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_MUL).

g ₄ can be empty or the combination of the following DAGs of cuDNN operations. Each of these DAGs is optional, as shown by the dotted line.

The combination has to obey the order in which we present them. For example, if you want to use the padding mask and softmax, the padding mask has to appear before softmax.

These operations are commonly used in attention. In the following diagram, we depict how to create a DAG for each of the operations. In later versions, we will be expanding the possible DAGs for g ₃ and g ₄.

Padding Mask

Causal Mask

Softmax

Dropout

g ₄ is capable of storing an intermediate tensor to global memory marked as S, which can be used for fused multi-head attention bprop. Both DAG:Softmax and DAG:Dropout have this capability. Set S as the output from the last DAG in the graph.

The tensor descriptor marked as S must have the CUDNN_ATTR_TENSOR_REORDERING_MODE set to CUDNN_TENSOR_REORDERING_F16x16. This is because the tensor is stored in a special format and can only be consumed by fused attention bprop.

There is an additional option of generating the mask on the user end and passing it directly to the pointwise multiplier. The mask needs to be of I/O data type FP16/BF16 and S will store the mask in the sign bit to communicate to bprop.

Limitations Of `Mha-fprop` Fusions
	Limitation
Matmul	Compute type for both `matmul` ops must be `float`. Input tensors must have data type `FP16` or `BF16`. Output tensors must have data type `FP16`, `BF16`, or `FP32 (TF32)`.
Pointwise operations in g ₃ and g ₄	Compute type must be `FP32 (TF32)`.
Reduction operations in g ₃ and g ₄	I/O types and compute types must be `FP32 (TF32)`.
RNG operation in g ₃ and g ₄	Data type of `yTensor` must be `FP32 (TF32)`. The `CUDNN_TYPE_RNG_DISTRIBUTION` must be `CUDNN_RNG_DISTRIBUTION_BERNOULLI`.

Layout requirements of Mha-fprop fusions include:

All I/O tensors must have 4 dimensions, with the first two denoting the batch dimensions. The usage of rank-4 tensors in matmul ops can be read from the Backend Descriptor Types documentation.

The contracting dimension (dimension K) for the first matmul must be 64.

The non-contracting dimension (dimensions M and N) for the first matmul must be less than or equal to 512. In inference mode, any sequence length is functional. For training, support exists only for multiples of 64.

The last dimension (corresponding to hidden dimensions) in Q, V, and O is expected to have stride 1.

For the K tensor, the stride is expected to be 1 for the 2nd last dimension.

The S tensor is expected to have CUDNN_ATTR_TENSOR_REORDERING_MODE set to CUDNN_TENSOR_REORDERING_F16x16.

Fused Attention bprop

Mha-Bprop fusions are executed in a fused pattern in a single kernel.

\(dV=matmul\left( g_{5}\left( S \right), dO \right)\) \(dS=matmul\left( dO, VT\right)\) \(dQ=matmul\left( g_{6}\left( dS \right), K \right)\) \(dK=matmul\left( Q, g_{7}\left( dS \right)\right)\)

cuDNN supports the corresponding backpropagation graph for fused attention. This can be used together with the fused attention fprop graph to perform training on models that have similar architectures to BERT and T5. This is not compatible with the flash fused attention bprop operation graph.

g ₅, g ₆, and g ₇ can only support a fixed DAG. We are working towards generalizing these graphs.

g ₆ represents the backward pass of softmax and masking, to get dP.

There are options for the Mask DAG that you can opt-in. You can either use the padding/causal mask, general mask as an input, or not do any masking.

dBias DAG is useful to calculate the bprop of the relative positional encoding and is optional and available for you to opt-in.

g ₇ is the transpose of dP the output of g ₆.

Limitations Of `Mha-bprop` Fusions
	Limitation
Matmul	Compute type for both `matmul` ops must be `float`. Input tensors must have data type `FP16` or `BF16`. Output tensors must have data type `FP16`, `BF16`, or `FP32 (TF32)`.
Pointwise operations in g ₅, g ₆, and g ₇	Compute type must be `FP32 (TF32)`.
Reduction operations in g ₅, g ₆, and g ₇	I/O types and compute types must be `FP32 (TF32)`.

Layout requirements of Mha-bprop fusions include:

All I/O tensors must have 4 dimensions, with the first two denoting the batch dimensions. The usage of rank-4 tensors in matmul ops can be read from the Backend Descriptor Types documentation.

The contracting dimension (dimension K) for the second matmul must be 64.

The contracting dimension (dimension K) for the first, second, and third matmul must be less than or equal to 512 and a multiple of 64.

The last dimension (corresponding to hidden dimensions) in Q, K, V, O, and dO is expected to have stride 1.

The S and dP tensors are expected to have CUDNN_ATTR_TENSOR_REORDERING_MODE set to CUDNN_TENSOR_REORDERING_F16x16.

Fused Flash Attention fprop

cuDNN supports flash fused attention to perform scale dot product attention commonly used in models like GPT, BERT, and so on. The general pattern supported by this engine is BMM-Softmax-BMM with many other optional features that you can opt into. You can choose to create the graph by yourself or use the custom sdpa node in cuDNN frontend. Using the frontend node will make opting into the different options like causal mask, dropout, alibi masking, and so on, very easy.

Pre-softmax optional DAGs cover multiple options for the users to configure:

pointwise Multiply node for attention scale after the first matmul

pointwise Add node for the relative positional encoding to add a bias after the first matmul

Different masking options like causal masking, padding masking, sliding window attention, and alibi masking. Users can choose multiple masking schemes together or no masking.

pointwise Multiply node that accepts a full tensor that can be used as a custom mask generated by the user

pointwise nodes that represent activation functions like CUDNN_POINTWISE_TANH_FWD

Post-softmax optional DAGs cover multiple options for the users to configure:

pointwise Multiply node with a RNG node to signify dropout

pointwise Multiply node with a user generated tensor acting as the dropout mask

All these DAGs are optional. A user can enable them depending on the cuDNN API they are targeting. If using the sdpa node in cuDNN frontend, they can set the provided API options to true, for example use_causal_mask(True) and internally, the frontend will add the correct graph automatically. While using the graph API directly, users can add the corresponding graph of the operations they want into the cuDNN graph.

The compound operations for example: Causal Mask, Sliding Window Mask, Softmax, and so on, can be represented using the following operation graphs in cuDNN.

Snapshot of support of Fused Flash Attention ``fprop``

Causal Mask

Padding Mask

Sliding Window Mask

Alibi Mask

Softmax

Dropout

Limitations For The Input And The Output Non-Virtual Tensors
	Limitation
`Q`, `K^T`, and `V` tensor	All tensors must be either `FP16` or `BF16` data type. Contracting dimension for `Q` must be a multiple of 8 with maximum value 256. Non-contracting dimension for `V` must be a multiple of 8 with maximum value 256. Contracting dimension for `Q` and `K^T` needs to have stride 1 in the layout. Non-contracting dimension for `V` needs to have stride 1 in the layout. The second dimension in `K^T` corresponding to the number of heads can be a factor of the number of heads of `Q`. The second dimension in `V` corresponding to the number of heads can be a factor of the number of heads of `Q`.
Softmax stats	Data type must be `FP32`. Data must be in row major format.
`O` tensor	Data type must be either `FP16` or `BF16`. The stride for the last dimension corresponding to the hidden dim per head should be 1.
`Seed` and `Offset`	INT32 or INT64 scalar in host or GPU
`Scale` to `Pointwise`	Attention scale can be FP16/BF16/FP32.

Inference mode can be turned on by passing the Softmax stats as a virtual tensor and setting the RNG node probability to 0.0f. The pattern is supported for GPUs with NVIDIA Ampere architecture and newer.

cuDNN also supports Fused Flash Attention in FP8 data type supported on NVIDIA Hopper GPUs. In addition to the standard fprop graph, there are additional dequantization scales, quantization scales, and absolute max (amax) calculations. The current FP8 support is a subset of the features supported in BF16 support. We are actively working on expanding the support for the FP8 kernels. For more information about the scales and amax, refer to the Specialized Pre-Compiled Engines, FP8 Fused Flash Attention section.

In the following diagram, red tensors indicate FP8 datatype tensors and black tensors are in FP32 datatype.

Pre-softmax optional DAGs cover multiple options for you to configure:

pointwise Multiply node for attention scale after the first matmul

Masking options include causal masking and no masking

Post-softmax optional DAGs cover multiple options for you to configure:

Currently there is no support for dropout

Limitations For The Input And The Output Non-Virtual Tensors for FP8 Flash Attention
	Limitation
`Q`, `K^T`, and `V` tensor	All tensors must be either `E4M3` or `E5M2` data type. Contracting dimension for `Q` must be a multiple of 16 with maximum value of 256. Non-contracting dimension for `V` must be a multiple of 16 with maximum value of 256. Contracting dimension for `Q` and `K^T` needs to have stride 1 in the layout. Non-contracting dimension for `V` needs to have stride 1 in the layout. The second dimension in `K^T` corresponding to the number of heads can be a factor of the number of heads of `Q`. The second dimension in `V` corresponding to the number of heads can be a factor of the number of heads of `Q`.
Softmax stats	Data type must be `FP32`. Data must be in row major format.
`O` tensor	Data type must be either `E4M3` or `E5M2`. The stride for the last dimension corresponding to the hidden dim per head should be 1.
`Scale` to `Pointwise`	Attention scale can be FP32.

Fused Flash Attention bprop

cuDNN supports the corresponding backpropagation graph for fused flash attention. This can be used together with the fprop graph to perform training on Large Language Models (LLMs).

All the options mentioned in fprop are applicable in the bprop graph as well. The corresponding bprop frontend node contains the same options and can be configured to do the bprop. Users opting in for the graph API, again need to add the graphs of the operations they want. The graph shown below is for a standard attention layer in GPT with causal masking and dropout.

Note

The bprop support for activation functions has not been added to NVIDIA Ampere GPUs; it only exists on NVIDIA Hopper GPUs.

For Grouped Query Attention (GQA) and Multi Query Attention (MQA), you can configure an additional reduction node for dK and dV, which reduces the tensor from the full number of heads (Q heads) to the actual K and V heads.

For the input and output tensors, the limitations from the fprop graph are carried over. For the bprop specific tensors, the limitations are as follows:

Limitations For The `bprop` Specific Tensors
	Limitation
`dQ`, `dK`, and `dV` tensor	All tensors must be either `FP16` or `BF16` data type. The last dimension corresponding to the hidden dim per head must be a multiple of 8 with maximum value 128. The stride for the last dimension corresponding to the hidden dim per head should be 1.
Softmax sum	Data type must be `FP32`. Data must be in row major format.
`dO` tensor	Data type must be either `FP16` or `BF16`. The last dimension corresponding to the hidden dim per head must be a multiple of 8 with maximum value 128. The stride for the last dimension corresponding to the hidden dim per head should be 1. The layout of the tensor is required to be the same as the `O` tensor.
`dqAccum` tensor	Data type must be `FP32`. The tensor must be `memset` to zero before passing to cuDNN. Data must be in row major format.

The pattern is supported for GPUs with NVIDIA Ampere architecture and newer.

cuDNN also supports Fused Flash Attention bprop in native FP8 data type supported on NVIDIA Hopper GPUs. In addition to the standard bprop graph, there are additional dequantization scales, quantization scales, and absolute max (amax) calculations. The current FP8 Flash Attention bprop support is corresponding to the FP8 Flash Attention fprop support.

In the following diagram, red tensors indicate FP8 datatype tensors and black tensors are in FP32 datatype.

Limitations For The FP8 Fused Flash Attention `bprop` Specific Tensors
	Limitation
`dQ`, `dK`, and `dV` tensor	All tensors must be either `E4M3` or `E5M2` data type. The last dimension corresponding to the hidden dim per head must be 128. The stride for the last dimension corresponding to the hidden dim per head should be 1.
`dO` tensor	Data type must be either `E4M3` or `E5M2`. The last dimension corresponding to the hidden dim per head must be 128. The stride for the last dimension corresponding to the hidden dim per head should be 1. The layout of the tensor is required to be the same as the `O` tensor.

Specialized Pre-Compiled Engines

The pre-compiled specialized engines target and optimize for a specialized graph pattern with a ragged support surface. Because of this targeting, these graphs do not require runtime compilation.

In most cases, the specialized patterns are just special cases of the generic patterns used in the runtime fusion engines, but there are some cases where the specialized pattern does not fit any of the generic patterns. If your graph pattern matches a specialized pattern, you will get at least a pattern matching engine, and you might also get runtime fusion engines as another option.

Currently, the following patterns are supported by the pattern matching engines. Some nodes are optional. Optional nodes are indicated by dashed outlines.

ConvBNfprop

The ConvBNfprop pattern is illustrated in the following figure. Its restrictions and options include:

The three pointwise nodes scale, bias, and ReLU are optional.

X, Z, W, s ₁, b :sub:1 must all be of FP16 data type.

Z needs to be of shape [N, C, H, W] with NHWC packed layout.

W needs to be of shape [K, C, R, S] with KRSC packed layout.

s ₁, b :sub:1 need to be of shape [1, C, 1, 1] with NHWC packed layout.

Only ReLU activation is supported.

All of the intermediate tensors need to be virtual, except Y needs to be non-virtual.

I/O pointers should be 16 bytes aligned.

This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).

On devices with compute capability >= 9.0, we only support two patterns:

the full pattern: scale + bias + ReLU + Conv + GenStats, and

the partial pattern: Conv + GenStats.

ConvBNfprop, A Pre-Compiled Engine, Fuses ConvolutionFwd and GenStats With Several Pointwise Operations

Skip connections are commonly observed in ResNet-like models. To support fusions in skip connections, we support a variant of the pattern above, the DBARCS pattern (short for Dual, Scale, Bias, Add, ReLU, Conv genStats). The limitations and options of the DBARCS pattern include:

The pointwise dual scale and dual bias nodes are either both present or not. This is indicated by the dashed block encircling the dual scale and dual bias nodes. In case both the nodes are missing, the dual_X tensor is directly fed as input to the add node.

The pointwise nodes scale, bias, add, and ReLU are required nodes.

Currently, only supported on Hopper GPUs.

For all the other data types, layout and virtualness restrictions of the ConvBNfprop pattern apply to this pattern as well.

dual_X, dual_scale, and dual_bias must all be of FP16 data type.

dual_scale and dual_bias must be of shape [1,C,1,1] with NHWC packed layout.

Intermediate outputs of the ReLU and Conv nodes: Relu_Y and Y are non-virtual. All the other intermediate outputs are virtual.

The weight tensor W for the convolution needs to be of shape [K,C,1,1]. Only 1x1 filters with padding 0 are supported for the convolution in the DBARCS pattern.

DBARCS In The convBNfprop Series For Supporting Fusions Across Skip Connections

ConvBNwgrad

The ConvBNwgrad pattern is illustrated in the following figure. Its restrictions and options include:

The three pointwise operations are all optional, as indicated by the dashed outlines.

Only ReLU activation is supported.

X, s ₁, b :sub:1, and dy must all be of FP16 datatype.

I/O pointers should be 16 bytes aligned.

X, s ₁, b :sub:1, and dy must all have NHWC packed layouts.

All the intermediate tensors need to be virtual.

This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).

On devices with compute capability >= 9.0, support is restricted to:

the full pattern: scale + bias + ReLU + wgrad.

ConvBNwgrad, A Pre-Compiled Engine, Fuses ConvolutionBwdFilter With Several (Optional) Pointwise Operations

ConvBiasAct

The ConvBiasAct pattern is illustrated in the following figure. Its restrictions and options include:

\(\alpha_{1}\) and \(\alpha_{2}\) need to be scalars.

The activation node is optional.

The size of the bias tensor should be [1, K, 1, 1].

Internal conversions are not supported. That is, the virtual output between nodes needs to have the same data type as the nodes compute type, which should be the same as the epilog type of the convolution node.

There are some restrictions on the supported combination of data types, which can be found in the API Reference (refer to cudnnConvolutionBiasActivationForward()).

ConvBiasAct, A Pre-Compiled Engine, Fuses ConvolutionFwd With Several Pointwise Operations

ConvScaleBiasAct

The ConvScaleBiasAct pattern is illustrated in the following figure. Its restrictions and options include:

\(\alpha_{1}\), \(\alpha_{2}\), and \(b_{1}\) should have the same data type/layout and can only be FP32.

X, W, and Z can only be INT8x4 or INT8x32.

The size of the bias tensor should be [1, K, 1, 1].

Internal conversions are not supported. That is, the virtual output between nodes needs to be the same as their compute type.

Currently, Pointwise:ReLU is the only optional pointwise node.

This pattern is very similar as ConvBiasAct. The difference is that here, the scales \(\alpha_{1}\) and \(\alpha_{2}\) are tensors, not scalars. If they are scalars, this pattern becomes a normal ConvBiasAct.

DgradDreluBNBwdWeight

The DgradDreluBNBwdWeight pattern is illustrated in the following figure. Its restrictions and options include:

Dgrad input dy and W are of FP16 datatypes.

Batch norm fwd inputs, X_bn is of FP16 datatype while the other tensors mean_bn, invstd_dev_bn, scale_bn, and bias_bn are FP32.

Outputs: dScale, dBias, A, B, C are of FP32 data type.

All pointers are 16 byte aligned.

This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).

DgradDreluBNBwdWeight is a pre-compiled engine that can be used in conjunction with the dBNApply pattern to compute the backwards path of batch norm.

DgradDreluBNBwdWeight Pattern For Fusions In The Backward Pass

The BNBwdWeight operation takes in five inputs: X_bn, mean_bn, invstddev_bn, scale_bn, and dy_bn (that is, the output from the ReLUBwd node).

It produces five outputs: gradients of the batch norm scale and bias params, dScale, dBias, and coefficients A, B, C. Note that for illustration purposes, the inputs are duplicated. The inputs on the left and right are however exactly the same.

This pattern is typically used in the computation of the Batch Norm Backward Pass.

When computing the backward pass of batch norm, dScale, dBias, and dX_bn are needed. The DgradDreluBnBwdWeight pattern computes the former two. Using the generated A, B, and C we can use the following dBNApply pattern to compute dX, the input gradient, as follows dx_bn = A*dy_bn + B*X_bn +C.

dBNApply Pattern For Final Gradient Computation

The dBNApply pattern was initially supported by a pre-compiled static engine but is now supported by the generic runtime fusion engine.

Note that the DgradDreluBNBwdWeight pattern is used in combination with the forward pass pattern ConvBNfprop. Because of performance reasons, the output of batch norm Y_bn, which was calculated in ConvBNfprop (output of scale-bias), needs to be recalculated by DgradDreluBnBwdWeight. The pointwise add node subtracts mean_bn from X_bn, hence the alpha2 parameter for that node should be set to -1.

FP8 Fused Flash Attention

cuDNN supports fused flash attention with input and output data types being in FP8 format. This FP8-specific graph pattern is supported only on Hopper (H100) GPUs.

Support exists for both training (forward and backward pass) and inference in FP8 format. The training forward pass is slightly different from the inference forward pass regarding whether some intermediate tensors are output or not.

Within the NVIDIA Hopper architecture, there are two new FP8 formats: E4M3 and E5M2. Currently, for forward pass, only when all the inputs and outputs are in E4M3 format is supported. For the backward pass, the support is only when some of the inputs and outputs are in E4M3 and some in E5M2. More general support for the FP8 formats will be added in future releases.

Due to the limited numerical precision of FP8 data type, for practical use cases, you must scale values computed in FP32 format before storing them in FP8 format, and descale the values stored in FP8 format before performing computations on them. For more information, refer to the Transformer Engine FP8 primer.

The following notation is used in this section.

b - number of batches

h - number of heads

s - maximum length of sequences in a batch

d - embedding dimension size of a word in a sequence

Scaling and Descaling

In the context of FP8, scaling refers to multiplying each element of a FP32 tensor by a quantization factor.

The quantization factor is computed as: (Max representable value in the FP8 format) / (Max absolute value seen in the tensor).

For the E4M3 format, the quantization factor is 448.f/ tensor_amax (rounded to the nearest lower power of two).

For the E5M2 format, the quantization factor is 57344.f / tensor_amax (rounded to the nearest lower power of two).

The dequantization factor is the reciprocal of the quantization factor.

The meaning behind scaling is to spawn the full range of the FP8 format when computing on FP8 values and storing FP8 values, thereby, minimizing the precision loss. True values in FP32 format are multiplied by the quantization factor before storing them as scaled values in FP8 format. Computations on scaled values in FP8 format are descaled by multiplying with the dequantization factor to convert them back to their true values in FP32 format.

Scaling and descaling are critical for convergence with the FP8 data type, hence cuDNN only supports graph patterns for FP8 fused attention with the scaling and descaling nodes present.

Unpadded Tensors

In fused flash attention, the length of different sequences in a batch can be different. The cuDNN operation graph supports an unpadded layout where all the sequences of different lengths in a batch are tightly packed. All the word embeddings after the useful length of the sequence are pushed towards the end of all sequences in the layout.

Forward Pass

The following figure shows the cuDNN operation graph for the fused attention forward pass. The same graph supports forward pass for both training and inference. The operation graph pattern is identified as training when M and Zinv tensors are non-virtual. When M and Zinv tensors are virtual, the operation graph pattern is identified as inference.

The FP8 tensors are expected to be scaled and the matrix multiplication computation is performed on the FP8 tensors in the scaled format. All non matrix multiplication computations are performed in FP32 precision. The output of the FP8 matrix multiplication is converted to real values in FP32 by format multiplying with the descale values.

FP8 Fused Attention Forward Pass Operation Graph

FP8 Fused Attention Forward Pass Input Tensors
Tensor Name	Data Type	Dimensions
`Q`	E4M3	[b, h, s, d]
`K`	E4M3	[b, h, s, d]
`V`	E4M3	[b, h, s, d]
Attention Scale	FP32 (by value)	[1, 1, 1, 1]
Descale `Q`	FP32	[1, 1, 1, 1]
Descale `K`	FP32	[1, 1, 1, 1]
Descale `V`	FP32	[1, 1, 1, 1]
Scale `S`	FP32	[1, 1, 1, 1]
Descale `S`	FP32	[1, 1, 1, 1]
Scale `O`	FP32	[1, 1, 1, 1]
RNG Seed	INT64	[1, 1, 1, 1]
RNG Offset	INT64	[1, 1, 1, 1]
Dropout Probability (p) or Keep Probability (1 - p)	FP32	[1, 1, 1, 1]

FP8 Fused Attention Forward Pass Output Tensors
Tensor Name	Data Type	Dimensions
`O`	E4M3	[b, h, s, d]
`Amax_O`	FP32	[1, 1, 1, 1]
`M`	FP32 (training only)	[b, h, s, 1]
`Zinv`	FP32 (training only)	[b, h, s, 1]

FP8 Fused Attention Forward Pass Limitations
	Limitation
`Q`, `K`, and `V` tensors	Required to be interleaved. Packing layout is (b,s,3,h,d). Must be in the E4M3 FP8 data type. Can be in an unpadded tightly packed layout. Either `Q`, `K`, `V`, and `O` can all be unpadded or none.
`O` tensor	Must be in (b,s,h,d) layout. Must be in the E4M3 format. Can be in an unpadded tightly packed layout. Either `Q`, `K`, `V`, and `O` can all be unpadded or none.
All virtual tensors	All virtual tensors must be in FP32 format.
Compute precision	The compute precision of all operations must be FP32.
`M`, `Zinv`	Must be in FP32 format. Should not be in an unpadded tightly packed layout.
Attention Scale, Quantization (scale) and Dequantization (descale) tensors	Must be in FP32 format.
Match between cuDNN API specifications and passed in device tensors	There is an implicit dependency between specifications done through cuDNN API and device tensors passed in runtime. The execution of the operation graph is undefined when the passed in device tensors do not conform with the API specifications.
Dropout probability for inference (p)	The dropout probability for forward pass inference must be zero.
Embedding dimension size (d)	Embedding dimension size of only 64 is supported.
Maximum sequence length (s)	Maximum sequence length sizes must be a multiple of 64. In addition, maximum sequence lengths sizes up-to only 512 are supported.

Backward Pass

FP8 Fused Attention Backward Pass Input Tensors
Tensor Name	Data Type	Dimensions
`Q`	E4M3	[b, h, s, d]
`K`	E4M3	[b, h, s, d]
`V`	E4M3	[b, h, s, d]
`O`	E4M3	[b, h, s, d]
`dO`	E5M2	[b, h, s, d]
`M`	FP32	[b, h, s, 1]
`Zinv`	FP32	[b, h, s, 1]
Attention Scale	FP32 (by value)	[1, 1, 1, 1]
Descale `Q`	FP32	[1, 1, 1, 1]
Descale `K`	FP32	[1, 1, 1, 1]
Descale `V`	FP32	[1, 1, 1, 1]
Scale `S`	FP32	[1, 1, 1, 1]
Descale `S`	FP32	[1, 1, 1, 1]
Descale `O`	FP32	[1, 1, 1, 1]
Descale `dO`	FP32	[1, 1, 1, 1]
Scale `dS`	FP32	[1, 1, 1, 1]
Descale `dS`	FP32	[1, 1, 1, 1]
Scale `dQ`	FP32	[1, 1, 1, 1]
Scale `dK`	FP32	[1, 1, 1, 1]
Scale `dV`	FP32	[1, 1, 1, 1]
RNG Seed	INT64	[1, 1, 1, 1]
RNG Offset	INT64	[1, 1, 1, 1]
Dropout Probability (p) or Keep Probability (1 - p)	FP32	[1, 1, 1, 1]

FP8 Fused Attention Backward Pass Output Tensors
Tensor Name	Data Type	Dimensions
`dQ`	E5M2	[b, h, s, d]
`dK`	E5M2	[b, h, s, d]
`dV`	E5M2	[b, h, s, d]
`Amax_dQ`	FP32	[1, 1, 1, 1]
`Amax_dK`	FP32	[1, 1, 1, 1]
`Amax_dV`	FP32	[1, 1, 1, 1]
`Amax_dS`	FP32 (virtual tensor `dS` is of E5M2 type)	[1, 1, 1, 1]

FP8 Fused Attention Backward Pass Limitations
	Limitation
`Q`, `K`, and `V` tensors	Required to be interleaved. Packing layout is (b,s,3,h,d). Must be in the E4M3 FP8 format. Can be in an unpadded tightly packed layout. Either `Q`, `K`, `V`, `O`, `dO`, `dQ`, `dK`, and `dV` can all be unpadded or none.
`dQ`, `dK`, and `dV` tensors	Required to be interleaved. Packing layout is (b,s,3,h,d). Must be in the E5M2 FP8 format. Can be in an unpadded tightly packed layout. Either `Q`, `K`, `V`, `O`, `dO`, `dQ`, `dK`, and `dV` can all be unpadded or none.
`O` and `dO` tensor	Must be in (b,s,h,d) layout. Must be in the E4M3 format. Can be in an unpadded tightly packed layout. Either `Q`, `K`, `V`, `O`, `dO`, `dQ`, `dK`, and `dV` can all be unpadded or none.
All virtual tensors	All virtual tensors must be in FP32 format.
Compute precision	The compute precision of all operations must be FP32.
`M`, `Zinv`	Must be in FP32 format. Should not be in an unpadded tightly packed layout.
Attention Scale, Quantization (scale) and Dequantization (descale) tensors	Must be in FP32 format.
Match between cuDNN API specifications and passed in device tensors	There is an implicit dependency between specifications done through cuDNN API and device tensors passed in runtime. The execution of the operation graph is undefined when the passed in device tensors do not conform with the API specifications.
Embedding dimension size (d)	Embedding dimension size of only 64 is supported.
Maximum sequence length (s)	Maximum sequence length sizes must be a multiple of 64. In addition, maximum sequence lengths sizes up-to only 512 are supported.

Mapping with Backend Descriptors

For readability, the operations used in this section are abbreviated. The mapping with the actual backend descriptors can be found in this table:

Notations and Backend Descriptors
Notations Used In This Section	Backend Descriptor
`Pointwise:scale`	`CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR` with mode `CUDNN_POINTWISE_MUL` and with operand B broadcasting into operand X
`Pointwise:bias`	`CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR` with mode `CUDNN_POINTWISE_ADD` and with operand B broadcasting into operand X
`Pointwise:add`	`CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR` with mode `CUDNN_POINTWISE_ADD` and with operand B with same dimensions as X
`Pointwise:mul`	`CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR` with mode `CUDNN_POINTWISE_MUL` and with operand B with same dimensions as X
`Pointwise:ReLU`	`CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR` with mode `CUDNN_POINTWISE_RELU_FWD`
`Pointwise:ReLUBwd`	`CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR` with mode `CUDNN_POINTWISE_RELU_BWD`
`Pointwise:tanh`	`CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR` with mode `CUDNN_POINTWISE_TANH_FWD`
`Pointwise:sigmoid`	`CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR` with mode `CUDNN_POINTWISE_SIGMOID_FWD`
`Pointwise:ELU`	`CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR` with mode `CUDNN_POINTWISE_ELU_FWD`
`Pointwise:{ReLU,tanh,sigmoid,ELU}`	`CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR` with one of the following modes: `CUDNN_POINTWISE_RELU_FWD`, `CUDNN_POINTWISE_TANH_FWD`, `CUDNN_POINTWISE_SIGMOID_FWD`, `CUDNN_POINTWISE_ELU_FWD`
`matmul`	`CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR`
`ConvolutionFwd`	`CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR`
`ConvolutionBwdFilter`	`CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR`
`ConvolutionBwdData`	`CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR`
`GenStats`	`CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR`
`ResampleFwd`	`CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR`
`Reduction`	`CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR`
`BnBwdWeight`	`CUDNN_BACKEND_OPERATION_BN_BWD_WEIGHTS_DESCRIPTOR`
`NormForward`	`CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR`
`NormBackward`	`CUDNN_BACKEND_OPERATION_NORM_BACKWARD_DESCRIPTOR`
`BOOLEAN` / `packed-BOOLEAN`	`CUDNN_DATA_BOOLEAN`: As described in the cuDNN API Reference, this type implies that eight boolean values are packed in a single byte, with the lowest index on the right (that is, least significant bit). `packed-BOOLEAN` and `BOOLEAN` are used interchangeably, where the former is used to emphasize and remind the user about the semantics.
`INT8`	`CUDNN_DATA_INT8`
`FP8`	`CUDNN_DATA_FP8_E4M3` or `CUDNN_DATA_FP8_E5M2`
`FP16`	`CUDNN_DATA_HALF`
`BF16`	`CUDNN_DATA_BFLOAT16`
`FP32`	`CUDNN_DATA_FLOAT`
`TF32`	A tensor core operation mode used to accelerate floating point convolutions or matmuls. This can be used for an operation with compute type `CUDNN_DATA_FLOAT`, on NVIDIA Ampere architecture or later and be disabled with `NVIDIA_TF32_OVERRIDE=1`.