Graph API
The cuDNN library provides a declarative programming model for describing computation as a graph of operations. This graph API was introduced in cuDNN 8.0 to provide a more flexible API, especially with the growing importance of operation fusion.
The user starts by building a graph of operations. At a high level, the user is describing a dataflow graph of operations on tensors. Given a finalized graph, the user then selects and configures an engine that can execute that graph. There are several methods for selecting and configuring engines, which have tradeoffs with respect to easeofuse, runtime overhead, and engine performance.
The graph API has two entry points:
Backend Descriptor Types (lowest level entry point into the graph API)
NVIDIA cuDNN Frontend API (convenience layer on top of the C backend API)
We expect that most users prefer the cuDNN frontend API because:
It is less verbose without loss of control  all functionality accessible through the backend API is also accessible through the frontend API.
It adds functionality on top of the backend API, like errata filters and autotuning.
It is open source.
In either case (that is, the backend or frontend API), the high level concepts are the same.
Key Concepts
As mentioned previously, the key concepts in the graph API are:
These concepts are covered in the subsections below. Later we’ll go through an example to tie them all together.
Operations and Operation Graphs
An operation graph is a dataflow graph of operations on tensors. It is meant to be a mathematical specification and is decoupled from the underlying engines that can implement it, as there may be more than one engine available for a given graph.
I/O tensors connect the operations implicitly, for example, an operation A may produce a tensor X, which is then consumed by operation B, implying that operation B depends on operation A.
Engines and Engine Configurations
For a given operation graph, there are some number of engines that are candidates for implementing that graph. The typical way to query for a list of candidate engines is through a heuristics query, covered below.
An engine has knobs for configuring properties of the engine, like tile size (refer to cudnnBackendKnobType_t).
Heuristics
A heuristic is a way to get a list of engine configurations that are intended to be sorted from the most performant to least performant for the given operation graph. There are three modes:
CUDNN_HEUR_MODE_A
 intended to be fast and be able to handle most operation graph patterns. It returns a list of engine configs ranked by the expected performance.
CUDNN_HEUR_MODE_B
 intended to be more generally accurate than mode A, but with the tradeoff of higher CPU latency to return the list of engine configs. The underlying implementation may fall back to the mode A heuristic in cases where we know mode A can do better.
CUDNN_HEUR_MODE_FALLBACK
 intended to be fast and provide functional fallbacks without expectation of optimal performance.
The recommended workflow is to query either mode A or B and check for support. The first engine config with support is expected to have the best performance.
You can “autotune”, that is, iterate over the list and time for each engine config and choose the best one for a particular problem on a particular device. The cuDNN frontend API provides a convenient function, cudnnFindPlan()
, which does this.
If all the engine configs are not supported, then use the mode fallback to find the functional fallbacks.
Expert users may also want to filter engine configs based on properties of the engine, such as numerical notes, behavior notes, or adjustable knobs. Numerical notes inform the user about the numerical properties of the engine such as whether it does datatype down conversion at the input or during output reduction. The behavior notes can signal something about the underlying implementation like whether or not it uses runtime compilation. The adjustable knobs allow fine grained control of the engine’s behavior and performance.
Graph API Example with Operation Fusion
The following example implements a fusion of convolution, bias, and activation.
Creating Operation and Tensor Descriptors to Specify the Graph Dataflow
First, create three cuDNN backend operation descriptors.
In the following figure, the user specified one forward convolution operation (using CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR
), a pointwise operation for the bias addition (using CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
with mode CUDNN_POINTWISE_ADD
), and a pointwise operation for the ReLU activation (using CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
with mode CUDNN_POINTWISE_RELU_FWD
). Refer to the Backend Descriptor Types for more details on setting the attributes of these descriptors. For an example of how a forward convolution can be set up, refer to the Setting Up An Operation Graph For A Grouped Convolution use case.
You should also create tensor descriptors for the inputs and outputs of all of the operations in the graph. The graph dataflow is implied by the assignment of tensors, for example, by specifying the backend tensor Tmp0
as both the output of the convolution operation and the input of the bias operation, cuDNN infers that the dataflow runs from the convolution into the bias. The same applies to tensor Tmp1
. If the user doesn’t need the intermediate results Tmp0
and Tmp1
for any other use, then the user can specify them to be virtual tensors, so the memory I/Os can later be optimized out.
Graphs with more than one operation node do not support inplace operations (that is, where any of the input UIDs matches any of the output UIDs). Such inplace operations are considered cyclic in later graph analysis and deemed unsupported. Inplace operations are supported for singlenode graphs.
The operation descriptors can be created and passed into cuDNN in any order, as the tensor UIDs are enough to determine the dependencies in the graph.
Finalizing The Operation Graph
Second, the user finalizes the operation graph. As part of finalization, cuDNN performs the dataflow analysis to establish the dependency relationship between operations and connect the edges, as illustrated in the following figure. In this step, cuDNN performs various checks to confirm the validity of the graph.
Configuring An Engine That Can Execute The Operation Graph
Third, given the finalized operation graph, the user must select and configure an engine to execute that graph, which results in an execution plan. As mentioned in Heuristics, the typical way to do this is:
Query heuristics mode A or B.
Look for the first engine config with functional support (or autotune all the engine configs with functional support).
If no engine config was found in (2), try querying the fallback heuristic for more options.
Executing The Engine
Finally, with the execution plan constructed and when it comes time to run it, the user should construct the backend variant pack by providing the workspace pointer, an array of UIDs, and an array of device pointers. The UIDs and the pointers should be in the corresponding order. With the handle, the execution plan and variant pack, the execution API can be called and the computation is carried out on the GPU.
Supported Graph Patterns
The cuDNN Graph API supports a set of graph patterns. These patterns are supported by a large number of engines, each with their own support surfaces. These engines are grouped into four different classes, as reflected by the following four subsections: precompiled single operation engines, generic runtime fusion engines, specialized runtime fusion engines, and specialized precompiled fusion engines. The specialized engines, whether they use runtime compilation or precompilation, are targeted to a set of important use cases, and thus have a fairly limited set of patterns they currently support. Over time, we expect to support more of those use cases with the generic runtime fusion engines, whenever practical.
Since these engines have some overlap in the patterns they support, a given pattern may result in zero, one, or more engines.
PreCompiled Single Operation Engines
One basic class of engines includes precompiled engines that support an operation graph with just one operation; specifically: ConvolutionFwd
, ConvolutionBwdFilter
, ConvolutionBwdData
, or ConvolutionBwBias
. Their more precise support surface can be found in the NVIDIA cuDNN API Reference.
ConvolutionBwdData
ConvolutionBwdData
computes the convolution data gradient of the tensor dy
. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionBackwardData().
ConvolutionBwdFilter
ConvolutionBwdFilter
computes the convolution filter gradient of the tensor dy
. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionBackwardFilter().
ConvolutionFwd
ConvolutionFwd
computes the convolution of X with filter data W. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionForward().
NormalizationBackward
NormalizationBackward
computes the gradient dX
and the scale and bias gradients dScale
and dBias
. This operation supports multiple modes which are set by the attribute CUDNN_ATTR_OPERATION_NORM_BWD_MODE
. The precompiled engines support instance and layer norm backward while batch norm backward is supported by a specialized runtime compiled engine (refer to DReluForkDBn). The mean and variance saved during the forward training pass is passed as input to the NormBackward
operation.
Node and Other Attributes 
Instance Normalization Backward 
Layer Normalization Backward 
RMS Normalization Backward 






[N, C, (D), H, W], input, I type 
[N, C, (D), H, W], input, I type 
[N, C, (D), H, W], input, I type 

[N,C,(1),1,1], input, compute type 
[N,1,(1),1,1], input, compute type 
N/A 

[N,C,(1),1,1], input, compute type 
[N,1,(1),1,1], input, compute type 
[N,1,(1),1,1], input, compute type 

[1,C,(1),1,1], input weight, W type 
[1,C,(D),H,W], input weight, W type 
[1,C,(D),H,W], input weight, W type 

[N, C, (D), H, W], input, O type 
[N, C, (D), H, W], input, O type 
[N, C, (D), H, W], input, O type 

[N, C, (D), H, W], output, I type 
[N, C, (D), H, W], output, I type 
[N, C, (D), H, W], output, I type 

[1,C,(1),1,1], output, W type 
[1,C,(D),H,W], output, W type 
[1,C,(D),H,W], output, W type 

[1,C,(1),1,1], output, W type 
[1,C,(D),H,W], output, W type 
Optional 




Supported layout 
NC(D)HW, N(D)HWC 
NC(D)HW, N(D)HWC 
NC(D)HW, N(D)HWC 
Supported I and O types 
FP16, FP32, BF16 
FP16, FP32, BF16 
FP16, FP32, BF16 
Supported compute type 
FP32 
FP32 
FP32 
Supported W types 
FP32 
FP16, FP32, BF16 
FP16, FP32, BF16 
Alignment requirements for I/O type 
8 bytes aligned 
16 bytes aligned 
16 bytes aligned 
For each operation, all applicable tensors must have the same layout. Neither mixed I/O types, nor mixed compute types are supported.
Layer Norm and RMS Norm are also supported by specialized runtime compiled engines, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3
for normalization backward operations. For these more performant engines, sizeof(Itype) >= sizeof(Otype)
restriction applies.
NormalizationForward
NormalizationForward
computes the normalization output Y
from the input X
. This operation is used in both the inference and training phase. The phases are distinguished by the attribute CUDNN_ATTR_OPERATION_NORM_FWD_PHASE
.
This operation supports different normalization modes which are set by the attribute CUDNN_ATTR_OPERATION_NORM_FWD_MODE
. The dashed lines indicate optional inputs, which are typically used in the batch norm mode of this operation. Currently, instance norm and layer norm are supported via both precompiled and specialized runtime compiled engines while batch norm and RMS norm are supported using specialized runtime compiled engines (refer to BnAddRelu).
Node and Other Attributes 
Instance Normalization Forward 
Layer Normalization Forward 
RMS Normalization Forward 






[N, C, (D), H, W], input, I type 
[N, C, (D), H, W], input, I type 
[N, C, (D), H, W], input, I type 

[N,C,(1),1,1], output, compute type, only applicable to 
[N,1,(1),1,1], output, compute type, only applicable to 
N/A 

[N,C,(1),1,1], output, compute type, only applicable to 
[N,1,(1),1,1], output, compute type, only applicable to 
[N,1,(1),1,1], output, compute type,only applicable to 

[1,C,(1),1,1], input weight, W type 
[1,C,(D),H,W], input weight, W type 
[1,C,(D),H,W], input weight, W type 

[1,C,(1),1,1], input weight, W type 
[1,C,(D),H,W], input weight, W type 
Optional (no bias by default) 

[N, C, (D), H, W], output, O type 
[N, C, (D), H, W], output, O type 
[N, C, (D), H, W], output, O type 

[1,1,1,1], input, constant 
[1,1,1,1], input, constant 
[1,1,1,1], input, constant 




Supported 



Supported layout 
NC(D)HW, N(D)HWC 
NC(D)HW, N(D)HWC 
NC(D)HW, N(D)HWC 
Supported I and O types 
FP16, FP32, BF16 
FP16, FP32, BF16 
FP16, FP32, BF16 
Supported compute type 
FP32 
FP32 
FP32 
Supported weight types, W type 
FP32 
FP16, FP32, BF16 
FP16, FP32, BF16 
Alignment requirements for I/O type 
8 bytes aligned 
16 bytes aligned 
16 bytes aligned 
For each operation, all applicable tensors must have the same layout.
Layer Norm and RMS Norm are also supported by specialized runtime compiled engines, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3
with fmode
set to CUDNN_NORM_FWD_TRAINING
and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 4
with fmode
set to CUDNN_NORM_FWD_INFERENCE
. For these more performant engines, sizeof(Itype) >= sizeof(Otype)
restriction applies. For nonruntime compiled engines, W type is the compute type.
Generic Runtime Fusion Engines
The engines documented in the previous section support singleop patterns. Of course, for fusion to be interesting, the graph needs to support multiple operations. And ideally, we want the supported patterns to be flexible to cover a diverse set of use cases. To accomplish this generality, cuDNN has runtime fusion engines that generate the kernel (or kernels) at runtime based on the graph pattern. This section outlines the patterns supported by these runtime fusion engines (that is, engines with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION
behavioral note).
We can think of the support surface as covering the following generic patterns:
Matmul
fusions: \(g_{2}\left( C=Matmul\left( A=g_{1A} \left( inputs \right), B=g_{1B} \right(inputs)), inputs \right)\)ConvolutionFwd
fusions: \(g_{2}\left( Y=ConvolutionFwd\left( X=g_{1} \left( inputs \right), W\right), inputs \right)\)ConvolutionBwdFilter
fusions: \(g_{2}\left( dw=ConvolutionBwdFiler\left( dy, X=g_{1} \right(inputs)), inputs \right)\)ConvolutionBwdData
fusions: \(g_{2}\left( dx=ConvolutionBwdData\left( dy=g_{1} \left( inputs \right), W \right), inputs \right)\)Pointwise
fusions: \(g_{2}\left( inputs \right)\)
Note
g _{1} (including g _{1A} and g _{1B}) indicates fusion operations applied to the inputs of the
matmul
andconvolution
operation.g _{2} indicates fusion operations apply to the output of the
matmul
andconvolution
operation.g _{2} can have more than one output.
The fusion patterns in g _{1} will be referred to as the mainloop fusion, and the fusion patterns in g _{2} will be referred to as the epilogue fusion.
The arrow going into g _{2} can go into any of g _{2} nodes and does not necessarily need to feed into a root node.
The abbreviated notations for operations are used in the diagrams and throughout the text for visualization purposes.
Support Surface
The generic runtime fusion engine includes three independent support surfaces indexed as 90, 80, and 70. A cuDNN graph that fulfills the requirements of at least one support surface will be able to be executed by the generic runtime fusion engine. The following table lists a summary of the features in each support surface. For best performance, we recommend targeting the highest indexed support surface possible and fall back to lower indexed support surfaces if needed.
Feature 
Support Surface 90 
Support Surface 80 
Support Surface 70 

Compute Capability 
>= 9.0 
>= 8.0 
>= 7.0 

Supported 
Supported 
Supported 

Supported 
Supported 
Supported 

Supported 
Not Supported 
Supported 

Partially Supported 
Not Supported 
Supported 

Not Supported 
Not Supported 
Supported 
FP8 
Supported 
Supported for Compute Capability >= 8.9 
Not Supported 
g _{1} (Mainloop) Fusions 
Supported 
Supported 
Partially Supported 
g _{2} (Epilogue) Fusions 
Supported 
Supported 
Supported 
Supported 
Supported 
Not Supported 

Grouped Convolution 
Supported 
Supported 
Not Supported 
The detailed supported features of each support surface are listed in the following subsections.
Support Surface 90
Compute Capability
NVIDIA GPUs with compute capability 9.0 are supported.
Generic Limitations
Strided
ConvolutionBwdData
fusions are not supported.Pointwise
andReduction
fusions are not supported.
Advanced Matmul/Convolution Variations
Mixed input precision
Matmul
,ConvolutionFwd
, andConvolutionBwdData
fusions are supported.Grouped
ConvolutionFwd
,ConvolutionBwdFilter
, andConvolutionBwdData
fusions are supported.
I/O and Intermediate Data Type
The input tensor data type can be any of
{FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}
.The input tensor data type of
Matmul
,ConvolutionFwd
,ConvolutionBwdFilter
, andConvolutionBwdData
operations can be any of{FLOAT, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}
.The output tensor data type can be any of
{INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2}
.The output data types of
CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operations can only beFLOAT
.The intermediate virtual tensor data type can be any of
{FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}
, and this intermediate storage type is obeyed by the codegenerator. Generally,FP32
is recommended.
Compute Data Type
Compute data type can be either
FP32
orINT32
forCUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operations.Compute data type can only be
FP32
forCUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operations.The support surface of the compute data type of
Matmul
,ConvolutionFwd
,ConvolutionBwdFilter
, andConvolutionBwdData
operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.













Mainloop Fusions: g :sub:`1`
g _{1} is a directed acyclic graph (DAG) that can consist of zero or any number of the
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operations.All the input tensors must have an alignment of 128 bits. For grouped
ConvolutionFwd
,ConvolutionBwdFilter
, andConvolutionBwdData
fusions, the alignment requirement is per group.All the intermediate tensors must be virtual.
The support surface of dimension and layout are listed in the table below.
Pattern 
Dimension 
Layout 



All tensors can be in either fully packed rowmajor layout or fully packed columnmajor layout. 


All tensors must be in a fully packed NHWC layout. 


All tensors can be in either fully packed NHWC layout or fully packed CHWN layout. In general, both dy and X tensors being in CHWN layout can have the best performance. 


All tensors can be in either fully packed NHWC layout or fully packed CHWN layout. In general, dy tensor being in NHWC layout and W tensor being in CHWN layout can have the best performance. 
Epilogue Fusion: g :sub:`2`
g _{2} is a directed acyclic graph (DAG) that can consist of zero or any number of the
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operation and zero or oneCUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operation.CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operation can only be the exit node of g _{2}.All the input and output tensors must have an alignment of 8 bits. For grouped
ConvolutionFwd
,ConvolutionBwdFilter
, andConvolutionBwdData
fusions, the alignment requirement is per group.In
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operations, the tensors being broadcasted cannot be placed as the first input.The support surface of dimension and layout are listed in the table below.
Pattern 
Dimension 
Layout 






All tensors must be in a fully packed NHWC layout. 


All tensors must be in a fully packed NHWC layout. 


All tensors must be in a fully packed NHWC layout. 

Not Supported 
Not Supported 
Support Surface 80
Compute Capability
NVIDIA GPUs with compute capability 8.0, 8.6. 8.7, 8.9, and 9.0 are supported.
Generic Limitations
ConvolutionBwdFilter
fusions are not supported.ConvolutionBwdData
fusions are not supported.Pointwise
andReduction
fusions are not supported.
Advanced Matmul/Convolution Variations
Mixed input precision
Matmul
andConvolutionFwd
fusions are supported.Grouped
ConvolutionFwd
fusions are supported.
I/O and Intermediate Data Type
The input tensor data type can be any of
{FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}
.The input tensor data type of
Matmul
andConvolutionFwd
operations can be any of{FLOAT, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}
.FP8_E4M3
andFP8_E5M2
input tensor data type ofMatmul
andConvolutionFwd
operations are only available with compute capability 8.9.The output tensor data type can be any of
{INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2}
.The output data types of
CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operations can only beFLOAT
.The intermediate virtual tensor data type can be any of
{FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}
, and this intermediate storage type is obeyed by the codegenerator. Generally,FP32
is recommended.FP8_E4M3
andFP8_E5M2
input, output, and intermediate tensor data type is only available with compute capability 8.9 and 9.0.
Compute Data Type
Compute data type can be either
FP32
orINT32
forCUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operations.Compute data type can only be
FP32
forCUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operations.The support surface of the compute data type of
Matmul
,ConvolutionFwd
,ConvolutionBwdFilter
, andConvolutionBwdData
operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.


Note 






Available with compute capability 8.9 only 








Mainloop Fusions: g :sub:`1`
g _{1} is a directed acyclic graph (DAG) that can consist of zero or any number of the
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operation.All the input tensors must have an alignment of 32 bits. For ConvolutionFwd fusions with no operations in g _{1}, the input tensors can have an alignment of 8 bits. For grouped
ConvolutionFwd
fusions, the alignment requirement is per group.All the intermediate tensors must be virtual.
The support surface of dimension and layout are listed in the table below.
Pattern 
Dimension 
Layout 



All tensors can be in either fully packed rowmajor layout or fully packed columnmajor layout 


All tensors must be in a fully packed NHWC layout. 

Not Supported 
Not Supported 

Not Supported 
Not Supported 
Epilogue Fusion: g :sub:`2`
g _{2} is a directed acyclic graph (DAG) that can consist of zero or any number of the
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operation and zero or oneCUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operation.CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operation can only be the exit node of g _{2}.All the input and output tensors must have an alignment of 8 bits. For grouped
ConvolutionFwd
fusions, the alignment requirement is per group.In
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operations, the tensors being broadcasted cannot be placed as the first input.The support surface of dimension and layout are listed in the table below.
Pattern 
Dimension 
Layout 






All tensors must be in a fully packed NHWC layout. 

Not Supported 
Not Supported 

Not Supported 
Not Supported 

Not Supported 
Not Supported 
Support Surface 70
Support Surface of Compute Capability
NVIDIA GPUs with compute capability 7.0, 7.2, 7.5, 8.0, 8.6, 8.7, 8.9, and 9.0 are supported.
I/O and Intermediate Data Type
The input tensor data type can be any of
{FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}
.The input tensor data type of
Matmul
,ConvolutionFwd
,ConvolutionBwdFilter
, andConvolutionBwdData
operations can be any of{FLOAT, HALF, BFLOAT16, INT8}
.The output tensor data type can be any of
{INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2, BOOLEAN}
.The output data types of
CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operations can only beFLOAT
.The intermediate virtual tensor data type can be any of
{FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2, BOOLEAN}
, and this intermediate storage type is obeyed by the codegenerator. Generally,FP32
is recommended.FP8_E4M3
andFP8_E5M2
data types are only allowed in purePointwise
andReduction
fusions.
Compute Data Type
Compute data type can be
FP32
orBOOLEAN
forCUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operations.Compute data type can only be
FP32
forCUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operations.The support surface of the compute data type of
Matmul
,ConvolutionFwd
,ConvolutionBwdFilter
, andConvolutionBwdData
operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.


Note 



Not available for 


Available with compute capability 8.9 only 








Mainloop Fusions: g :sub:`1`
g _{1} is a directed acyclic graph (DAG) that can consist of zero or any number of the following operations:
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR
CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR
CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR
orCUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR
operations, if present, should be before anyPointwise
operations.For compute capability < 8.0, g _{1} is not supported.
All the input tensors must have an alignment of 32 bits.
All the intermediate tensors must be virtual.
In
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operations, the tensors being broadcasted cannot be placed as the first input.The support surface of dimension and layout are listed in the table below.
Pattern 
Dimension 
Layout 






All tensors must be in a fully packed NHWC layout. 


All tensors must be in a fully packed NHWC layout. 


All tensors must be in a fully packed NHWC layout. 
Epilogue Fusion: g :sub:`2`
g _{2} is a directed acyclic graph (DAG) that can consist of zero or any number of the following operations:
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR
CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR
CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR
CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR
and zero or one CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operation.
CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operation can only be the exit node of g _{2}.CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR
operations, if present, must be the final nodes in g _{2}. Hence,CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR
operations cannot be used in conjunction with theCUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR
operation.The input tensor to a
CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR
orCUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR
operation should not be produced by another operation within this graph, but should come from global memory. These two operations cannot be used in theMatmul
,ConvolutionBwdFilter
, andConvolutionBwdData
fusions, and are only supported with compute capability >= 7.5.All the input and output tensors must have an alignment of 32 bits, except the output of a
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operation can have an alignment of 8 bits.In
CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
operations, the tensors being broadcasted cannot be placed as the first input.The support surface of dimension and layout are listed in the table below.
Pattern 
Dimension 
Layout 






All tensors must be in a fully packed NHWC layout. 


All tensors must be in a fully packed NHWC layout. 


All tensors must be in a fully packed NHWC layout. 



Operation Specific Constraints for the Runtime Fusion Engines
Every operation in the supported generic patterns of the runtime fusion engines is subject to a few specific constraints regarding their parameter surface. The following subsections document these.
Note that these constraints are in addition to (1) any constraints mentioned in the Backend Descriptor Types, and (2) limitations in relation to other operations in the directed acyclic graph (DAG), as mentioned in the Support Surface section.
Matmul
This operation represents matrixmatrix multiplication: A * B = C. For complete details on the interface, refer to the CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR section.
Convolutions
There are three operation nodes that represent different types of convolutions namely:
ConvolutionFwd
This operation represents forward convolution, that is, computing the response tensor of image tensor convoluted with filter tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR section.
ConvolutionBwdFilter
This operation represents convolution backward filters, that is, computing filter gradients from a response and an image tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR section.
ConvolutionBwdData
This operation represents convolution backward data, that is, computing input data gradients from a response and a filter tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR section.
Input Tensor Attribute Name 
Output Tensor Attribute Name 











The FP8 data type since NVIDIA Ada Lovelace architecture has two variants: CUDNN_DATA_FP8_E4M3
and CUDNN_DATA_FP8_E5M2
as I/O data types. Using them as inputs to the operation will result in FP8 Tensor Cores being used. The precision of the accumulation inside the FP8 Tensor Cores is controlled by the compute type, which may have one of two possible values: CUDNN_DATA_FLOAT
and CUDNN_DATA_FAST_FLOAT_FOR_FP8
.
CUDNN_DATA_FAST_FLOAT_FOR_FP8
is faster and sufficiently accurate for inference or the forward pass of training. However, for FP8 training backward pass computations (that is, computing weight and activation gradients), we recommend choosing the more accurate CUDNN_DATA_FLOAT
compute type to preserve a higher level of accuracy which may be necessary for some models.
Operation 
Recommended I/O Type 
Recommended Compute Type 
















Pointwise
Represents a pointwise operation that implements the equation Y = op (alpha1 * X)
or Y = op (alpha1 * X, alpha2 * B)
. Refer to the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR and CUDNN_BACKEND_POINTWISE_DESCRIPTOR sections for more information and general constraints.
The following tables list the constraints for Pointwise
operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when these operations are used in the runtime fusion engines.
Attribute 
Requirement 

Tensor data type for 
For all operators, all data types are supported. 






Attribute 
Requirement 

Tensor data type for 







GenStats
Represents an operation that generates perchannel statistics. Refer to the CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR section for more information and general constraints.
The following table lists the constraints for GenStats operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when GenStats operations are used in the runtime fusion engines.
Attribute 
Requirement 

Tensor data type for 

Tensor shape for 
Both should be of shape [1, C, 1, 1] for 2D conv or [1, C, 1, 1, 1] for 3D conv. 
Tensor data type for 



Tensor layout for 
NHWC fully packed 
Reduction
This operation represents reducing values of a tensor in one or more dimensions. Refer to the CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR section for more information and general constraints.
The following table lists constraints for Reduction forward operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when Reduction operations are used in the runtime fusion engines.
Attribute 
Requirement 

Tensor data type for 



Tensor layout for 
NHWC/NDHWC/BMN fully packed 


ResampleFwd
This operation represents resampling of the spatial dimensions of an image to a desired value. Resampling is supported in both directions, upsampling and downsampling. Downsampling represents the standard operation of pooling, commonly used in convolutional neural networks. Refer to the CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR section for more information and general constraints.
The following are constraints for ResampleFwd
operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when ResampleFwd
operations are used in the runtime fusion engines.
We allow a choice amongst four modes for resample. All modes have the following common support specifications:
Supported layout: NHWC or NDHWC, NCHW or NCDHW
Spatial dimensions supported: 2 or 3
Input dimensions supported: 4 or 5
Packed boolean data type is not supported.
If specified, the index tensor dimension should be equal to the response tensor dimension.
When the tensor format is NCHW/NCDHW, the following additional restrictions apply:
Upsampling is not supported.
Int64_t
indices are not supported.Only supports symmetric padding using the prepadding backend API.
There are some mode specific restrictions also. The following tables list the values that are allowed for particular parameters. For the parameters not listed, we allow any value which is mathematically correct.
The following downsampling modes are supported:
CUDNN_RESAMPLE_AVGPOOL_INCLUDE_PADDING
CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING
CUDNN_RESAMPLE_MAXPOOL
Attribute 
Average Pooling 
Max Pooling 













For the upsampling modes, CUDNN_RESAMPLE_NEAREST
is not supported for any combination of parameters. CUDNN_RESAMPLE_BILINEAR
has the following support specifications.
Attribute 
Bilinear 

Input dimensions 
Equal to 








Data type for 









Resampling Index Tensor Dump for Training
For maxpooling resampling mode, an index tensor can be provided to be used as a mask for backpropagation.
Values in the index tensors are:
Zeroindexed rowmajor position of maximum value of input tensor in the resampling window.
In case of multiple input pixels with maximum value, the first index in a lefttoright toptobottom scan is selected.
An example of index element selection:
Select an appropriate element size for the index tensor. As a reference, any element size such that the maximum zeroindexed window position fits should be sufficient.
ResampleBwd
This operation represents backward resampling of the spatial dimensions of an output response to a desired value. Resampling is supported in both directions, upsampling and downsampling. Backwards downsampling represents the standard operation of backward pooling, commonly used in convolutional neural networks. Refer to the CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR section for more information and general constraints.
The following are constraints for Resample backward operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when Resample backward operations are used in the runtime fusion engines.
We allow a choice amongst four modes for resample. All modes have the following common support specifications:
Supported layout: NHWC or NDHWC, NCHW or NCDHW
Spatial dimensions supported: 2 or 3
Input dimensions supported: 4 or 5
For layout NHWC or NDHWC:
The index tensor should be provided for only max pooling mode, and should adhere to the format described in the Resampling Index Tensor Dump for Training section.
The index tensor dimensions should be equal to the input gradient tensor dimensions.
For layout NCHW or NCDHW:
X, Y, and DY are required when max pooling mode is used.
Int64_t
indices are not supported.
There are some mode specific restrictions also. The following tables list the values that are allowed for particular parameters. For the parameters not listed, we allow any value which is mathematically correct.
The following backward downsampling modes are supported:
CUDNN_RESAMPLE_AVGPOOL_INCLUDE_PADDING
CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING
CUDNN_RESAMPLE_MAXPOOL
Attribute 
Average Pooling 
Max Pooling 













Backward upsampling modes are currently not supported.
Examples of Supported Patterns
The following sections provide examples of supported patterns, in order of increasing complexity. We employ the same color scheme as in the overall pattern to aid in identifying the structure of g _{1} (blue) and g _{2} (purple).
For illustration purposes, we abbreviated the operations used. For a full mapping to the actual backend descriptors, refer to the Mapping with Backend Descriptors.
Single Operation
The following example illustrates a convolution operation without any operations before or after it. This means, g _{1} and g _{2}, are empty graphs.
Pointwise Operations After Convolution 1
In this example, g _{2} consists of a sequential set of two Pointwise
operations after the convolution.
Pointwise Operations After Convolution 2
Similar to the previous example, g _{2} consists of a sequential set of multiple Pointwise
operations.
Pointwise Operations Before Matrix Multiplication
Pointwise
operations can also precede a convolution or matrix multiplication, that is, g _{1} is composed of Pointwise
operations.
Convolution Producer Node in Middle of DAG
The following pattern shows g _{1} as a DAG of Pointwise
operations feeding into a convolution. In addition, g _{2} is a DAG consisting of two Pointwise
operations. Note that the convolution is being consumed in the middle of g _{2} as opposed to g _{2} first node. This is a valid pattern.
Mixed Input Precision Matmul and Convolution
Mixed input precision for matmuls and convolutions is implemented as a special case of mainloop fusion. Inputs may have different data types and will be converted to the desired data types serving as the inputs of the matmul
or convolution
operation by a Pointwise:Identity
operation. The following pattern shows g _{1} as a DAG of Pointwise:Identity
operation converting the input data type of tensor A into a matmul
operation. This is a valid pattern.
Specialized Runtime Fusion Engines
The specialized runtime fusion engines target and optimize specialized graph patterns that commonly occur in popular deep learning models. These engines offer limited flexibility regarding supported fusion patterns, supported data types, and supported tensor layouts. Long term, these patterns are expected to be more generic.
The following sections highlight the supported patterns.
BnAddRelu
In ResNetlike vision models, batch normalization followed by ReLU activation is a commonly occurring pattern. The BNAddRelu
fusion pattern, supported using a runtime compiled engine, aims to optimize this recurring operation graph. It also supports single node multiGPU batch normalization for speeding up batch norm computation in multiGPU systems. The pattern is intended for use in the forward pass during the training phase. The full pattern BNAddRelu
with the add node is used in cases where there are skip connections in the model.
The pattern is illustrated in the following diagram and its options and limitations include:
The pointwise nodes:
Add
,ReLU
, andGT
(greater than) are optional.All tensors should be in NHWC packed layout format.
Both 4D and 5D tensors are supported.
Only ReLU activation is supported.
The attribute
CUDNN_ATTR_OPERATION_NORM_FWD_MODE
for the norm forward operation must be set toCUDNN_BATCH_NORM
.The attribute
CUDNN_ATTR_OPERATION_NORM_FWD_PHASE
for the norm forward operation must be set toCUDNN_NORM_FWD_TRAINING
.The batch norm input tensors:
Scale
,Bias
,Input_running_mean
, andInput_running_var
must be offloat
data type.The batch norm output tensors:
output_running_mean
,output_running_var
,mean
, andInvVariance
must be offloat
data type.The batch norm input tensor
BN_x
, residual inputZ
and output tensorY
can be any of{FP32, FP16, BF16}
data types. ForFP16
andBF16
data types, the channel count C for the tensors must be a multiple of 8 while for float data type the channel count must be a multiple of 4.These patterns are supported on devices with compute capability >= 8.0.
In case of single node multiGPU batch norm, each GPU computes the local statistics based on its input data and writes out the local statistics to the peerTensors
. Each peerTensor
resides on a separate GPU on the node and is used for reading and writing local statistics from the peer GPUs. This is followed by a global stats computation phase where each GPU aggregates the statistics from the peers and computes the global mean and variance for the batch norm output computation on its local data. Apart from the options and limitations listed above, the following additional restrictions apply for using multiGPU batch norm:
The attribute
CUDNN_ATTR_OPERATION_NORM_FWD_PEER_STAT_DESCS
of theNormForward
operation must be set.The size of the
peerTensors
vector should be equal to the number of GPUs in the node participating in the batch norm computation.The maximum size of the
peerTensors
vector is 32.Each GPU should operate on the same size of input data [N,C,H,W].
The size of each
peerTensor
in thepeerTensors
vector should be equal tonum_gpu * 4 * C
where C is the channel count of theBN_x
tensor andnum_gpu
is the number of GPUs in the node participating in the batch norm computation.All the elements of each tensor in the
peerTensors
vector should bememset
to0
before passing that tensor in the variant pack.
DReluForkDBn
Similar to the BnAddRelu
pattern, the DReluForkDBn
pattern also targets ResNetlike vision networks. It is intended to be used in backpropagation during the training phase. The DReluForkDBn
pattern is supported through a runtime compiled engine that usually complements the BnAddRelu
pattern. It also supports single node multiGPU batch normalization for speeding up batch norm backward computation in multiGPU systems.
The pattern is illustrated in the following diagram and its options and limitations include:
The pointwise node
dRelu
is optional.The intermediate tensor
dZ
can be virtual or nonvirtual.All tensors should be in NHWC packed layout format.
Both 4D and 5D tensors are supported.
Only
dRelu
activation is supported.Bitmask tensor input is needed for the
dRelu
node.The attribute
CUDNN_ATTR_OPERATION_NORM_BWD_MODE
for the norm backward operation must be set toCUDNN_BATCH_NORM
.The batch norm backward input tensors:
Scale
,Mean
,InvVariance
and the output tensorsdScale
anddBias
must be offloat
data type.
dRelu
input tensordY
, batch norm backward inputBN_x
, bias gradientdZ
, and output tensordX
can be any of{FP32, FP16, BF16}
data types. ForFP16
andBF16
data types, the channel count C for the tensors must be a multiple of 8 while forfloat
data type the channel count must be a multiple of 4.These patterns are supported on devices with compute capability >= 8.0.
The single node multiGPU version of this pattern is typically used for dScale
and dBias
gradient aggregation across GPUs. For using the multiGPU version, the attribute CUDNN_ATTR_OPERATION_NORM_BWD_PEER_STAT_DESCS
of the NormBackward
operation must be set. Other restrictions for the peerTensors
vector listed in the previous section apply for this pattern as well.
Fused Attention fprop
MhaFprop
fusions \(O=matmul\left( S=g_{4} \left( P=matmul\left( Q, g_{3}\left( K \right) \right), V \right)\right)\) have been added to the runtime fusion engine to serve patterns that are commonly used in attention. These patterns can be used in BERT, T5, and so on.
There are two key differences to the flash fused attention patterns described in later sections:
Input sizes supported contain small sequence lengths (<= 512).
The operation graph is flexible to switch between different types of masks, different operations between the two matrix multiplications, and so on.
g _{3} can be an empty graph or a single scale operation with the scale being a scalar value (CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
with mode CUDNN_POINTWISE_MUL
).
g _{4} can be empty or the combination of the following DAGs of cuDNN operations. Each of these DAGs is optional, as shown by the dotted line.
The combination has to obey the order in which we present them. For example, if you want to use the padding mask and softmax, the padding mask has to appear before softmax.
These operations are commonly used in attention. In the following diagram, we depict how to create a DAG for each of the operations. In later versions, we will be expanding the possible DAGs for g _{3} and g _{4}.
Padding Mask
Causal Mask
Softmax
Dropout
g _{4} is capable of storing an intermediate tensor to global memory marked as S
, which can be used for fused multihead attention bprop
. Both DAG:Softmax
and DAG:Dropout
have this capability. Set S
as the output from the last DAG in the graph.
The tensor descriptor marked as S
must have the CUDNN_ATTR_TENSOR_REORDERING_MODE
set to CUDNN_TENSOR_REORDERING_F16x16
. This is because the tensor is stored in a special format and can only be consumed by fused attention bprop
.
There is an additional option of generating the mask on the user end and passing it directly to the pointwise multiplier. The mask needs to be of I/O data type FP16/BF16 and S
will store the mask in the sign bit to communicate to bprop
.
Limitation 


Matmul 

Pointwise operations in g _{3} and g _{4} 
Compute type must be 
Reduction operations in g _{3} and g _{4} 
I/O types and compute types must be 
RNG operation in g _{3} and g _{4} 

Layout requirements of Mhafprop
fusions include:
All I/O tensors must have 4 dimensions, with the first two denoting the batch dimensions. The usage of rank4 tensors in
matmul
ops can be read from the Backend Descriptor Types documentation.The contracting dimension (dimension
K
) for the first matmul must be 64.The noncontracting dimension (dimensions
M
andN
) for the first matmul must be less than or equal to 512. In inference mode, any sequence length is functional. For training, support exists only for multiples of 64.The last dimension (corresponding to hidden dimensions) in
Q
,V
, andO
is expected to have stride1
.For the
K
tensor, the stride is expected to be1
for the 2nd last dimension.The
S
tensor is expected to haveCUDNN_ATTR_TENSOR_REORDERING_MODE
set toCUDNN_TENSOR_REORDERING_F16x16
.
Fused Attention bprop
MhaBprop
fusions are executed in a fused pattern in a single kernel.
\(dV=matmul\left( g_{5}\left( S \right), dO \right)\) \(dS=matmul\left( dO, VT\right)\) \(dQ=matmul\left( g_{6}\left( dS \right), K \right)\) \(dK=matmul\left( Q, g_{7}\left( dS \right)\right)\)
cuDNN supports the corresponding backpropagation graph for fused attention. This can be used together with the fused attention fprop
graph to perform training on models that have similar architectures to BERT and T5. This is not compatible with the flash fused attention bprop
operation graph.
g _{5}, g _{6}, and g _{7} can only support a fixed DAG. We are working towards generalizing these graphs.
g _{6} represents the backward pass of softmax and masking, to get dP
.
There are options for the Mask DAG that you can optin. You can either use the padding/causal mask, general mask as an input, or not do any masking.
dBias
DAG is useful to calculate the bprop
of the relative positional encoding and is optional and available for you to optin.
g _{7} is the transpose of dP the output of g _{6}.
Limitation 


Matmul 

Pointwise operations in g _{5}, g _{6}, and g _{7} 
Compute type must be 
Reduction operations in g _{5}, g _{6}, and g _{7} 
I/O types and compute types must be 
Layout requirements of Mhabprop
fusions include:
All I/O tensors must have 4 dimensions, with the first two denoting the batch dimensions. The usage of rank4 tensors in
matmul
ops can be read from the Backend Descriptor Types documentation.The contracting dimension (dimension
K
) for the second matmul must be 64.The contracting dimension (dimension
K
) for the first, second, and third matmul must be less than or equal to 512 and a multiple of 64.The last dimension (corresponding to hidden dimensions) in
Q
,K
,V
,O
, anddO
is expected to have stride1
.The
S
anddP
tensors are expected to haveCUDNN_ATTR_TENSOR_REORDERING_MODE
set toCUDNN_TENSOR_REORDERING_F16x16
.
Fused Flash Attention fprop
cuDNN supports flash fused attention to perform scale dot product attention commonly used in models like GPT, BERT, and so on. The general pattern supported by this engine is BMMSoftmaxBMM with many other optional features that you can opt into. You can choose to create the graph by yourself or use the custom sdpa
node in cuDNN frontend. Using the frontend node will make opting into the different options like causal mask, dropout, alibi masking, and so on, very easy.
Presoftmax optional DAGs cover multiple options for the users to configure:
pointwise
Multiply
node for attention scale after the first matmulpointwise
Add
node for the relative positional encoding to add a bias after the first matmulDifferent masking options like causal masking, padding masking, sliding window attention, and alibi masking. Users can choose multiple masking schemes together or no masking.
pointwise
Multiply
node that accepts a full tensor that can be used as a custom mask generated by the userpointwise nodes that represent activation functions like
CUDNN_POINTWISE_TANH_FWD
Postsoftmax optional DAGs cover multiple options for the users to configure:
pointwise
Multiply
node with a RNG node to signify dropoutpointwise
Multiply
node with a user generated tensor acting as the dropout mask
All these DAGs are optional. A user can enable them depending on the cuDNN API they are targeting. If using the sdpa
node in cuDNN frontend, they can set the provided API options to true
, for example use_causal_mask(True)
and internally, the frontend will add the correct graph automatically. While using the graph API directly, users can add the corresponding graph of the operations they want into the cuDNN graph.
The compound operations for example: Causal Mask, Sliding Window Mask, Softmax, and so on, can be represented using the following operation graphs in cuDNN.
Causal Mask
Padding Mask
Sliding Window Mask
Alibi Mask
Softmax
Dropout
Limitation 




Softmax stats 




INT32 or INT64 scalar in host or GPU 

Attention scale can be FP16/BF16/FP32. 
Inference mode can be turned on by passing the Softmax stats as a virtual tensor and setting the RNG node probability to 0.0f
. The pattern is supported for GPUs with NVIDIA Ampere architecture and newer.
cuDNN also supports Fused Flash Attention in FP8 data type supported on NVIDIA Hopper GPUs. In addition to the standard fprop
graph, there are additional dequantization scales, quantization scales, and absolute max (amax) calculations. The current FP8 support is a subset of the features supported in BF16 support. We are actively working on expanding the support for the FP8 kernels. For more information about the scales and amax, refer to the Specialized PreCompiled Engines, FP8 Fused Flash Attention section.
In the following diagram, red tensors indicate FP8 datatype tensors and black tensors are in FP32 datatype.
Presoftmax optional DAGs cover multiple options for you to configure:
pointwise
Multiply
node for attention scale after the first matmulMasking options include causal masking and no masking
Postsoftmax optional DAGs cover multiple options for you to configure:
Currently there is no support for dropout
Limitation 




Softmax stats 




Attention scale can be FP32. 
Fused Flash Attention bprop
cuDNN supports the corresponding backpropagation graph for fused flash attention. This can be used together with the fprop
graph to perform training on Large Language Models (LLMs).
All the options mentioned in fprop
are applicable in the bprop
graph as well. The corresponding bprop
frontend node contains the same options and can be configured to do the bprop
. Users opting in for the graph API, again need to add the graphs of the operations they want. The graph shown below is for a standard attention layer in GPT with causal masking and dropout.
Note
The
bprop
support for activation functions has not been added to NVIDIA Ampere GPUs; it only exists on NVIDIA Hopper GPUs.
For Grouped Query Attention (GQA) and Multi Query Attention (MQA), you can configure an additional reduction node for dK
and dV
, which reduces the tensor from the full number of heads (Q
heads) to the actual K
and V
heads.
For the input and output tensors, the limitations from the fprop
graph are carried over. For the bprop
specific tensors, the limitations are as follows:
Limitation 




Softmax sum 





The pattern is supported for GPUs with NVIDIA Ampere architecture and newer.
cuDNN also supports Fused Flash Attention bprop in native FP8 data type supported on NVIDIA Hopper GPUs. In addition to the standard bprop
graph, there are additional dequantization scales, quantization scales, and absolute max (amax) calculations. The current FP8 Flash Attention bprop
support is corresponding to the FP8 Flash Attention fprop
support.
In the following diagram, red tensors indicate FP8 datatype tensors and black tensors are in FP32 datatype.
Limitation 






Specialized PreCompiled Engines
The precompiled specialized engines target and optimize for a specialized graph pattern with a ragged support surface. Because of this targeting, these graphs do not require runtime compilation.
In most cases, the specialized patterns are just special cases of the generic patterns used in the runtime fusion engines, but there are some cases where the specialized pattern does not fit any of the generic patterns. If your graph pattern matches a specialized pattern, you will get at least a pattern matching engine, and you might also get runtime fusion engines as another option.
Currently, the following patterns are supported by the pattern matching engines. Some nodes are optional. Optional nodes are indicated by dashed outlines.
ConvBNfprop
The ConvBNfprop
pattern is illustrated in the following figure. Its restrictions and options include:
The three pointwise nodes scale, bias, and ReLU are optional.
X, Z, W, s _{1}, b :sub:
1
must all be of FP16 data type.Z needs to be of shape [N, C, H, W] with NHWC packed layout.
W needs to be of shape [K, C, R, S] with KRSC packed layout.
s _{1}, b :sub:
1
need to be of shape [1, C, 1, 1] with NHWC packed layout.Only ReLU activation is supported.
All of the intermediate tensors need to be virtual, except Y needs to be nonvirtual.
I/O pointers should be 16 bytes aligned.
This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).
On devices with compute capability >= 9.0, we only support two patterns:
the full pattern: scale + bias + ReLU + Conv + GenStats, and
the partial pattern: Conv + GenStats.
Skip connections are commonly observed in ResNetlike models. To support fusions in skip connections, we support a variant of the pattern above, the DBARCS pattern (short for Dual, Scale, Bias, Add, ReLU, Conv genStats). The limitations and options of the DBARCS pattern include:
The pointwise dual scale and dual bias nodes are either both present or not. This is indicated by the dashed block encircling the dual scale and dual bias nodes. In case both the nodes are missing, the
dual_X
tensor is directly fed as input to the add node.The pointwise nodes scale, bias, add, and ReLU are required nodes.
Currently, only supported on Hopper GPUs.
For all the other data types, layout and virtualness restrictions of the
ConvBNfprop
pattern apply to this pattern as well.
dual_X
,dual_scale
, anddual_bias
must all be of FP16 data type.
dual_scale
anddual_bias
must be of shape [1,C,1,1] with NHWC packed layout.Intermediate outputs of the ReLU and Conv nodes:
Relu_Y
andY
are nonvirtual. All the other intermediate outputs are virtual.The weight tensor W for the convolution needs to be of shape [K,C,1,1]. Only 1x1 filters with padding 0 are supported for the convolution in the DBARCS pattern.
ConvBNwgrad
The ConvBNwgrad
pattern is illustrated in the following figure. Its restrictions and options include:
The three pointwise operations are all optional, as indicated by the dashed outlines.
Only ReLU activation is supported.
X, s _{1}, b :sub:
1
, anddy
must all be of FP16 datatype.I/O pointers should be 16 bytes aligned.
X, s _{1}, b :sub:
1
, anddy
must all have NHWC packed layouts.All the intermediate tensors need to be virtual.
This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).
On devices with compute capability >= 9.0, support is restricted to:
the full pattern: scale + bias + ReLU +
wgrad
.
ConvBiasAct
The ConvBiasAct
pattern is illustrated in the following figure. Its restrictions and options include:
\(\alpha_{1}\) and \(\alpha_{2}\) need to be scalars.
The activation node is optional.
The size of the bias tensor should be [1, K, 1, 1].
Internal conversions are not supported. That is, the virtual output between nodes needs to have the same data type as the nodes compute type, which should be the same as the epilog type of the convolution node.
There are some restrictions on the supported combination of data types, which can be found in the API Reference (refer to cudnnConvolutionBiasActivationForward()).
ConvScaleBiasAct
The ConvScaleBiasAct
pattern is illustrated in the following figure. Its restrictions and options include:
\(\alpha_{1}\), \(\alpha_{2}\), and \(b_{1}\) should have the same data type/layout and can only be FP32.
X, W, and Z can only be INT8x4 or INT8x32.
The size of the bias tensor should be [1, K, 1, 1].
Internal conversions are not supported. That is, the virtual output between nodes needs to be the same as their compute type.
Currently,
Pointwise:ReLU
is the only optional pointwise node.
This pattern is very similar as ConvBiasAct
. The difference is that here, the scales \(\alpha_{1}\) and \(\alpha_{2}\) are tensors, not scalars. If they are scalars, this pattern becomes a normal ConvBiasAct
.
DgradDreluBNBwdWeight
The DgradDreluBNBwdWeight
pattern is illustrated in the following figure. Its restrictions and options include:
Dgrad input
dy
and W are of FP16 datatypes.Batch norm fwd inputs,
X_bn
is of FP16 datatype while the other tensorsmean_bn
,invstd_dev_bn
,scale_bn
, andbias_bn
are FP32.Outputs:
dScale
,dBias
, A, B, C are of FP32 data type.All pointers are 16 byte aligned.
This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).
DgradDreluBNBwdWeight
is a precompiled engine that can be used in conjunction with the dBNApply
pattern to compute the backwards path of batch norm.
The BNBwdWeight
operation takes in five inputs: X_bn
, mean_bn
, invstddev_bn
, scale_bn
, and dy_bn
(that is, the output from the ReLUBwd
node).
It produces five outputs: gradients of the batch norm scale and bias params, dScale
, dBias
, and coefficients A, B, C. Note that for illustration purposes, the inputs are duplicated. The inputs on the left and right are however exactly the same.
This pattern is typically used in the computation of the Batch Norm Backward Pass.
When computing the backward pass of batch norm, dScale
, dBias
, and dX_bn
are needed. The DgradDreluBnBwdWeight
pattern computes the former two. Using the generated A, B, and C we can use the following dBNApply
pattern to compute dX
, the input gradient, as follows dx_bn = A*dy_bn + B*X_bn +C
.
The dBNApply
pattern was initially supported by a precompiled static engine but is now supported by the generic runtime fusion engine.
Note that the DgradDreluBNBwdWeight
pattern is used in combination with the forward pass pattern ConvBNfprop
. Because of performance reasons, the output of batch norm Y_bn
, which was calculated in ConvBNfprop
(output of scalebias), needs to be recalculated by DgradDreluBnBwdWeight
. The pointwise add node subtracts mean_bn
from X_bn
, hence the alpha2
parameter for that node should be set to 1
.
FP8 Fused Flash Attention
cuDNN supports fused flash attention with input and output data types being in FP8 format. This FP8specific graph pattern is supported only on Hopper (H100) GPUs.
Support exists for both training (forward and backward pass) and inference in FP8 format. The training forward pass is slightly different from the inference forward pass regarding whether some intermediate tensors are output or not.
Within the NVIDIA Hopper architecture, there are two new FP8 formats: E4M3 and E5M2. Currently, for forward pass, only when all the inputs and outputs are in E4M3 format is supported. For the backward pass, the support is only when some of the inputs and outputs are in E4M3 and some in E5M2. More general support for the FP8 formats will be added in future releases.
Due to the limited numerical precision of FP8 data type, for practical use cases, you must scale values computed in FP32 format before storing them in FP8 format, and descale the values stored in FP8 format before performing computations on them. For more information, refer to the Transformer Engine FP8 primer.
The following notation is used in this section.
b  number of batches
h  number of heads
s  maximum length of sequences in a batch
d  embedding dimension size of a word in a sequence
Scaling and Descaling
In the context of FP8, scaling refers to multiplying each element of a FP32 tensor by a quantization factor.
The quantization factor is computed as: (Max representable value in the FP8 format) / (Max absolute value seen in the tensor).
For the E4M3 format, the quantization factor is 448.f/ tensor_amax
(rounded to the nearest lower power of two).
For the E5M2 format, the quantization factor is 57344.f / tensor_amax
(rounded to the nearest lower power of two).
The dequantization factor is the reciprocal of the quantization factor.
The meaning behind scaling is to spawn the full range of the FP8 format when computing on FP8 values and storing FP8 values, thereby, minimizing the precision loss. True values in FP32 format are multiplied by the quantization factor before storing them as scaled values in FP8 format. Computations on scaled values in FP8 format are descaled by multiplying with the dequantization factor to convert them back to their true values in FP32 format.
Scaling and descaling are critical for convergence with the FP8 data type, hence cuDNN only supports graph patterns for FP8 fused attention with the scaling and descaling nodes present.
Unpadded Tensors
In fused flash attention, the length of different sequences in a batch can be different. The cuDNN operation graph supports an unpadded layout where all the sequences of different lengths in a batch are tightly packed. All the word embeddings after the useful length of the sequence are pushed towards the end of all sequences in the layout.
Forward Pass
The following figure shows the cuDNN operation graph for the fused attention forward pass. The same graph supports forward pass for both training and inference. The operation graph pattern is identified as training when M
and Zinv
tensors are nonvirtual. When M
and Zinv
tensors are virtual, the operation graph pattern is identified as inference.
The FP8 tensors are expected to be scaled and the matrix multiplication computation is performed on the FP8 tensors in the scaled format. All non matrix multiplication computations are performed in FP32 precision. The output of the FP8 matrix multiplication is converted to real values in FP32 by format multiplying with the descale values.
Tensor Name 
Data Type 
Dimensions 


E4M3 
[b, h, s, d] 

E4M3 
[b, h, s, d] 

E4M3 
[b, h, s, d] 
Attention Scale 
FP32 (by value) 
[1, 1, 1, 1] 
Descale 
FP32 
[1, 1, 1, 1] 
Descale 
FP32 
[1, 1, 1, 1] 
Descale 
FP32 
[1, 1, 1, 1] 
Scale 
FP32 
[1, 1, 1, 1] 
Descale 
FP32 
[1, 1, 1, 1] 
Scale 
FP32 
[1, 1, 1, 1] 
RNG Seed 
INT64 
[1, 1, 1, 1] 
RNG Offset 
INT64 
[1, 1, 1, 1] 
Dropout Probability (p) or Keep Probability (1  p) 
FP32 
[1, 1, 1, 1] 
Tensor Name 
Data Type 
Dimensions 


E4M3 
[b, h, s, d] 

FP32 
[1, 1, 1, 1] 

FP32 (training only) 
[b, h, s, 1] 

FP32 (training only) 
[b, h, s, 1] 
Limitation 






All virtual tensors 
All virtual tensors must be in FP32 format. 
Compute precision 
The compute precision of all operations must be FP32. 


Attention Scale, Quantization (scale) and Dequantization (descale) tensors 
Must be in FP32 format. 
Match between cuDNN API specifications and passed in device tensors 
There is an implicit dependency between specifications done through cuDNN API and device tensors passed in runtime. The execution of the operation graph is undefined when the passed in device tensors do not conform with the API specifications. 
Dropout probability for inference (p) 
The dropout probability for forward pass inference must be zero. 
Embedding dimension size (d) 
Embedding dimension size of only 64 is supported. 
Maximum sequence length (s) 
Maximum sequence length sizes must be a multiple of 64. In addition, maximum sequence lengths sizes upto only 512 are supported. 
Backward Pass
Tensor Name 
Data Type 
Dimensions 


E4M3 
[b, h, s, d] 

E4M3 
[b, h, s, d] 

E4M3 
[b, h, s, d] 

E4M3 
[b, h, s, d] 

E5M2 
[b, h, s, d] 

FP32 
[b, h, s, 1] 

FP32 
[b, h, s, 1] 
Attention Scale 
FP32 (by value) 
[1, 1, 1, 1] 
Descale 
FP32 
[1, 1, 1, 1] 
Descale 
FP32 
[1, 1, 1, 1] 
Descale 
FP32 
[1, 1, 1, 1] 
Scale 
FP32 
[1, 1, 1, 1] 
Descale 
FP32 
[1, 1, 1, 1] 
Descale 
FP32 
[1, 1, 1, 1] 
Descale 
FP32 
[1, 1, 1, 1] 
Scale 
FP32 
[1, 1, 1, 1] 
Descale 
FP32 
[1, 1, 1, 1] 
Scale 
FP32 
[1, 1, 1, 1] 
Scale 
FP32 
[1, 1, 1, 1] 
Scale 
FP32 
[1, 1, 1, 1] 
RNG Seed 
INT64 
[1, 1, 1, 1] 
RNG Offset 
INT64 
[1, 1, 1, 1] 
Dropout Probability (p) or Keep Probability (1  p) 
FP32 
[1, 1, 1, 1] 
Tensor Name 
Data Type 
Dimensions 


E5M2 
[b, h, s, d] 

E5M2 
[b, h, s, d] 

E5M2 
[b, h, s, d] 

FP32 
[1, 1, 1, 1] 

FP32 
[1, 1, 1, 1] 

FP32 
[1, 1, 1, 1] 

FP32 (virtual tensor 
[1, 1, 1, 1] 
Limitation 








All virtual tensors 
All virtual tensors must be in FP32 format. 
Compute precision 
The compute precision of all operations must be FP32. 


Attention Scale, Quantization (scale) and Dequantization (descale) tensors 
Must be in FP32 format. 
Match between cuDNN API specifications and passed in device tensors 
There is an implicit dependency between specifications done through cuDNN API and device tensors passed in runtime. The execution of the operation graph is undefined when the passed in device tensors do not conform with the API specifications. 
Embedding dimension size (d) 
Embedding dimension size of only 64 is supported. 
Maximum sequence length (s) 
Maximum sequence length sizes must be a multiple of 64. In addition, maximum sequence lengths sizes upto only 512 are supported. 
Mapping with Backend Descriptors
For readability, the operations used in this section are abbreviated. The mapping with the actual backend descriptors can be found in this table:
Notations Used In This Section 
Backend Descriptor 






















































A tensor core operation mode used to accelerate floating point convolutions or matmuls. This can be used for an operation with compute type 