Graph API

The cuDNN library provides a declarative programming model for describing computation as a graph of operations. This graph API was introduced in cuDNN 8.0 to provide a more flexible API, especially with the growing importance of operation fusion.

The user starts by building a graph of operations. At a high level, the user is describing a dataflow graph of operations on tensors, which typically represents a partition of the user’s full network graph, which they would like to offload to a GPU kernel (or small set of kernels). Given a finalized graph, the user then selects and configures an engine that can execute that graph. There are several methods for selecting and configuring engines, which have tradeoffs with respect to ease-of-use, runtime overhead, and engine performance.

The graph API has two entry points:

We expect that most users prefer the cuDNN frontend API because:

  • It is less verbose without loss of control - all functionality accessible through the backend API is also accessible through the frontend API.

  • It adds functionality on top of the backend API, like errata filters and autotuning.

  • It is open source.

In either case (that is, the backend or frontend API), the high level concepts are the same.

Key Concepts

As mentioned previously, the key concepts in the graph API are:

These concepts are covered in the subsections below. Later we’ll go through an example to tie them all together.

Operations and Operation Graphs

An operation graph is a dataflow graph of operations on tensors. It is meant to be a mathematical specification and is decoupled from the underlying engines that can implement it, as there may be more than one engine available for a given graph.

I/O tensors connect the operations implicitly, for example, an operation A may produce a tensor X, which is then consumed by operation B, implying that operation B depends on operation A.

Engines and Engine Configurations

For a given operation graph, there are some number of engines that are candidates for implementing that graph. The typical way to query for a list of candidate engines is through a heuristics query, covered below.

An engine has knobs for configuring properties of the engine, like tile size (refer to cudnnBackendKnobType_t).

Other Runtime Concepts

Native CUDA Graph API

For select engines, cuDNN now provides a way to directly build a CUDA graph (not to be confused with a cuDNN graph) from an execution plan and a set of variant pointers. This is a more flexible alternative to using CUDA graph capture, as it allows the updating of an existing CUDA graph with new variant pack pointers. Like other CUDA graphs, the generated graph can be executed arbitrarily many times from different streams, and can be embedded as a child node of a larger CUDA graph.

This API consists of two APIs:

Engines supporting these APIs have the behavior note CUDNN_BEHAVIOR_NOTE_SUPPORTS_CUDA_GRAPH_NATIVE_API.

Heuristics

A heuristic is a way to get a list of engine configurations that are intended to be sorted from the most performant to least performant for the given operation graph. There are three modes:

  • CUDNN_HEUR_MODE_A - intended to be fast and be able to handle most operation graph patterns. It returns a list of engine configs ranked by the expected performance.

  • CUDNN_HEUR_MODE_B - intended to be more generally accurate than mode A, but with the tradeoff of higher CPU latency to return the list of engine configs. The underlying implementation may fall back to the mode A heuristic in cases where we know mode A can do better.

  • CUDNN_HEUR_MODE_FALLBACK - intended to be fast and provide functional fallbacks without expectation of optimal performance.

The recommended workflow is to query either mode A or B and check for support. The first engine config with support is expected to have the best performance.

You can “auto-tune”, that is, iterate over the list and time for each engine config and choose the best one for a particular problem on a particular device. The cuDNN frontend API provides a convenient function, cudnnFindPlan(), which does this.

If all the engine configs are not supported, then use the mode fallback to find the functional fallbacks.

Expert users may also want to filter engine configs based on properties of the engine, such as numerical notes, behavior notes, or adjustable knobs. Numerical notes inform the user about the numerical properties of the engine such as whether it does datatype down conversion at the input or during output reduction. The behavior notes can signal something about the underlying implementation like whether or not it uses runtime compilation. The adjustable knobs allow fine grained control of the engine’s behavior and performance.

Graph API Example with Operation Fusion

The following example implements a fusion of convolution, bias, and activation.

Creating Operation and Tensor Descriptors to Specify the Graph Dataflow

First, create three cuDNN backend operation descriptors.

In the following figure, the user specified one forward convolution operation (using CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR), a pointwise operation for the bias addition (using CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_ADD), and a pointwise operation for the ReLU activation (using CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_RELU_FWD). Refer to the Backend Descriptor Types for more details on setting the attributes of these descriptors. For an example of how a forward convolution can be set up, refer to the Setting Up An Operation Graph For A Grouped Convolution use case.

You should also create tensor descriptors for the inputs and outputs of all of the operations in the graph. The graph dataflow is implied by the assignment of tensors, for example, by specifying the backend tensor Tmp0 as both the output of the convolution operation and the input of the bias operation, cuDNN infers that the dataflow runs from the convolution into the bias. The same applies to tensor Tmp1. If the user doesn’t need the intermediate results Tmp0 and Tmp1 for any other use, then the user can specify them to be virtual tensors, so the memory I/Os can later be optimized out.

  • Graphs with more than one operation node do not support in-place operations (that is, where any of the input UIDs matches any of the output UIDs). Such in-place operations are considered cyclic in later graph analysis and deemed unsupported. In-place operations are supported for single-node graphs.

  • The operation descriptors can be created and passed into cuDNN in any order, as the tensor UIDs are enough to determine the dependencies in the graph.

A Set of Operation Descriptors the User Passes to the Operation Graph

Finalizing The Operation Graph

Second, the user finalizes the operation graph. As part of finalization, cuDNN performs the dataflow analysis to establish the dependency relationship between operations and connect the edges, as illustrated in the following figure. In this step, cuDNN performs various checks to confirm the validity of the graph.

The Operation Graph after Finalization

Configuring An Engine That Can Execute The Operation Graph

Third, given the finalized operation graph, the user must select and configure an engine to execute that graph, which results in an execution plan. As mentioned in Heuristics, the typical way to do this is:

  1. Query heuristics mode A or B.

  2. Look for the first engine config with functional support (or auto-tune all the engine configs with functional support).

  3. If no engine config was found in (2), try querying the fallback heuristic for more options.

Executing The Engine


Finally, with the execution plan constructed and when it comes time to run it, the user should construct the backend variant pack by providing the workspace pointer, an array of UIDs, and an array of device pointers. The UIDs and the pointers should be in the corresponding order. With the handle, the execution plan and variant pack, the execution API can be called and the computation is carried out on the GPU.

Supported Graph Patterns

The cuDNN Graph API supports a set of graph patterns. These patterns are supported by a large number of engines, each with their own support surfaces. These engines are grouped into four different classes, as reflected by the following four subsections: pre-compiled single operation engines, generic runtime fusion engines, specialized runtime fusion engines, and specialized pre-compiled fusion engines. The specialized engines, whether they use runtime compilation or pre-compilation, are targeted to a set of important use cases, and thus have a fairly limited set of patterns they currently support. Over time, we expect to support more of those use cases with the generic runtime fusion engines, whenever practical.

Since these engines have some overlap in the patterns they support, a given pattern may result in zero, one, or more engines.

Pre-Compiled Single Operation Engines

One basic class of engines includes pre-compiled engines that support an operation graph with just one operation; specifically: ConvolutionFwd, ConvolutionBwdFilter, ConvolutionBwdData, or ConvolutionBwBias. Their more precise support surface can be found in the NVIDIA cuDNN API Reference.

ConvolutionBwdData

ConvolutionBwdData computes the convolution data gradient of the tensor dy. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionBackwardData().

ConvolutionBwdData Engine

ConvolutionBwdFilter

ConvolutionBwdFilter computes the convolution filter gradient of the tensor dy. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionBackwardFilter().

ConvolutionBwdFilter Engine

ConvolutionFwd

ConvolutionFwd computes the convolution of X with filter data W. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionForward().

ConvolutionBwdFilter Engine

NormalizationBackward

NormalizationBackward computes the gradient dX and the scale and bias gradients dScale and dBias. This operation supports multiple modes which are set by the attribute CUDNN_ATTR_OPERATION_NORM_BWD_MODE. Currently, layer norm, batch norm, and RMS norm are all supported in both the GRAPH_JIT_ONLY and FULL library configs (refer to DReluForkDBn). Instance norm is only supported in the FULL library config. The mean and variance saved during the forward training pass is passed as input to the NormBackward operation.

NormalizationBackward Engine
Instance Norm, Layer Norm, and RMS Norm For NormalizationBackward

Node and Other Attributes

Instance Normalization Backward

Layer Normalization Backward

RMS Normalization Backward

operation

normBwd

normBwd

normBwd

X

[N, C, (D), H, W], input, I type

[N, C, (D), H, W], input, I type

[N, C, (D), H, W], input, I type

Mean

[N,C,(1),1,1], input, compute type

[N,1,(1),1,1], input, compute type

N/A

InvVariance

[N,C,(1),1,1], input, compute type

[N,1,(1),1,1], input, compute type

[N,1,(1),1,1], input, compute type

Scale

[1,C,(1),1,1], input weight, W type

[1,C,(D),H,W], input weight, W type

[1,C,(D),H,W], input weight, W type

DY

[N, C, (D), H, W], input, O type

[N, C, (D), H, W], input, O type

[N, C, (D), H, W], input, O type

DX

[N, C, (D), H, W], output, I type

[N, C, (D), H, W], output, I type

[N, C, (D), H, W], output, I type

Dscale

[1,C,(1),1,1], output, W type

[1,C,(D),H,W], output, W type

[1,C,(D),H,W], output, W type

Dbias

[1,C,(1),1,1], output, W type

[1,C,(D),H,W], output, W type

Optional

mode

CUDNN_INSTANCE_NORM

CUDNN_LAYER_NORM

CUDNN_RMS_NORM

Supported layout

NC(D)HW, N(D)HWC

NC(D)HW, N(D)HWC

NC(D)HW, N(D)HWC

Supported I and O types

FP16, FP32, BF16

FP16, FP32, BF16

FP16, FP32, BF16

Supported compute type

FP32

FP32

FP32

Supported W types

FP32

FP16, FP32, BF16

FP16, FP32, BF16

Alignment requirements for I/O type

8 bytes aligned

16 bytes aligned

16 bytes aligned

For each operation, all applicable tensors must have the same layout. Neither mixed I/O types, nor mixed compute types are supported.

Layer Norm and RMS Norm are also supported by specialized runtime compiled engines, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for normalization backward operations. For these more performant engines, sizeof(Itype) >= sizeof(Otype) restriction applies.

NormalizationForward

NormalizationForward computes the normalization output Y from the input X. This operation is used in both the inference and training phase. The phases are distinguished by the attribute CUDNN_ATTR_OPERATION_NORM_FWD_PHASE.

NormalizationForward Engine

This operation supports different normalization modes which are set by the attribute CUDNN_ATTR_OPERATION_NORM_FWD_MODE. The dashed lines indicate optional inputs, which are typically used in the batch norm mode of this operation. Currently, layer norm, batch norm, and RMS norm are all supported in the GRAPH_JIT_ONLY and FULL library configs, using runtime compiled kernels (refer to BnAddRelu). Instance norm is supported in the FULL library config only.

Instance Norm, Layer Norm, and RMS Norm For NormalizationForward

Node and Other Attributes

Instance Normalization Forward

Layer Normalization Forward

RMS Normalization Forward

operation

normFwd

normFwd

normFwd

X

[N, C, (D), H, W], input, I type

[N, C, (D), H, W], input, I type

[N, C, (D), H, W], input, I type

Mean

[N,C,(1),1,1], output, compute type, only applicable to fmode CUDNN_NORM_FWD_TRAINING

[N,1,(1),1,1], output, compute type, only applicable to fmode CUDNN_NORM_FWD_TRAINING

N/A

InvVariance

[N,C,(1),1,1], output, compute type, only applicable to fmode CUDNN_NORM_FWD_TRAINING

[N,1,(1),1,1], output, compute type, only applicable to fmode CUDNN_NORM_FWD_TRAINING

[N,1,(1),1,1], output, compute type,only applicable to fmode CUDNN_NORM_FWD_TRAINING

Scale

[1,C,(1),1,1], input weight, W type

[1,C,(D),H,W], input weight, W type

[1,C,(D),H,W], input weight, W type

Bias

[1,C,(1),1,1], input weight, W type

[1,C,(D),H,W], input weight, W type

Optional (no bias by default)

Y

[N, C, (D), H, W], output, O type

[N, C, (D), H, W], output, O type

[N, C, (D), H, W], output, O type

epsilonDesc

[1,1,1,1], input, constant

[1,1,1,1], input, constant

[1,1,1,1], input, constant

mode

CUDNN_INSTANCE_NORM

CUDNN_LAYER_NORM

CUDNN_RMS_NORM

Supported fmode

CUDNN_NORM_FWD_TRAINING, CUDNN_NORM_FWD_INFERENCE

CUDNN_NORM_FWD_TRAINING, CUDNN_NORM_FWD_INFERENCE

CUDNN_NORM_FWD_TRAINING, CUDNN_NORM_FWD_INFERENCE

Supported layout

NC(D)HW, N(D)HWC

NC(D)HW, N(D)HWC

NC(D)HW, N(D)HWC

Supported I and O types

FP16, FP32, BF16

FP16, FP32, BF16

FP16, FP32, BF16

Supported compute type

FP32

FP32

FP32

Supported weight types, W type

FP32

FP16, FP32, BF16

FP16, FP32, BF16

Alignment requirements for I/O type

8 bytes aligned

16 bytes aligned

16 bytes aligned

For each operation, all applicable tensors must have the same layout.

Layer Norm and RMS Norm are also supported by specialized runtime compiled engines, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 with fmode set to CUDNN_NORM_FWD_TRAINING and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 4 with fmode set to CUDNN_NORM_FWD_INFERENCE. For these more performant engines, sizeof(Itype) >= sizeof(Otype) restriction applies. For non-runtime compiled engines, W type is the compute type.

Generic Runtime Fusion Engines

The engines documented in the previous section support single-op patterns. Of course, for fusion to be interesting, the graph needs to support multiple operations. And ideally, we want the supported patterns to be flexible to cover a diverse set of use cases. To accomplish this generality, cuDNN has runtime fusion engines that generate the kernel (or kernels) at runtime based on the graph pattern. This section outlines the patterns supported by these runtime fusion engines (that is, engines with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION behavioral note).

We can think of the support surface as covering the following generic patterns:

  1. Matmul fusions: \(g_{2}\left( C=Matmul\left( A=g_{1A} \left( inputs \right), B=g_{1B} \right(inputs)), inputs \right)\)

  2. ConvolutionFwd fusions: \(g_{2}\left( Y=ConvolutionFwd\left( X=g_{1} \left( inputs \right), W\right), inputs \right)\)

  3. ConvolutionBwdFilter fusions: \(g_{2}\left( dw=ConvolutionBwdFiler\left( dy, X=g_{1} \right(inputs)), inputs \right)\)

  4. ConvolutionBwdData fusions: \(g_{2}\left( dx=ConvolutionBwdData\left( dy=g_{1} \left( inputs \right), W \right), inputs \right)\)

  5. Pointwise fusions: \(g_{2}\left( inputs \right)\)

Graphical Representation of the Generic Patterns Supported by the Runtime Fusion Engines

Note

  • g 1 (including g 1A and g 1B) indicates fusion operations applied to the inputs of the matmul and convolution operation.

  • g 2 indicates fusion operations apply to the output of the matmul and convolution operation.

  • g 2 can have more than one output.

  • The fusion patterns in g 1 will be referred to as the mainloop fusion, and the fusion patterns in g 2 will be referred to as the epilogue fusion.

  • The arrow going into g 2 can go into any of g 2 nodes and does not necessarily need to feed into a root node.

  • The abbreviated notations for operations are used in the diagrams and throughout the text for visualization purposes.

Support Surface

The generic runtime fusion engine includes three independent support surfaces indexed as 90, 80, and 70. A cuDNN graph that fulfills the requirements of at least one support surface will be able to be executed by the generic runtime fusion engine. The following table lists a summary of the features in each support surface. For best performance, we recommend targeting the highest indexed support surface possible and fall back to lower indexed support surfaces if needed.

Summary of Supported Features of Each Support Surface

Feature

Support Surface 90

Support Surface 80

Support Surface 70

Compute Capability

>= 9.0

>= 8.0

>= 7.0

Matmul Fusions

Supported

Supported

Supported

ConvolutionFwd Fusions

Supported

Supported

Supported

ConvolutionBwdFilter Fusions

Supported

Not Supported

Supported

ConvolutionBwdData Fusions

Partially Supported

Not Supported

Supported

Pointwise and Reduction Fusions

Not Supported

Not Supported

Supported

FP8 Matmul and Convolution Operations

Supported

Supported for Compute Capability >= 8.9

Not Supported

g 1 (Mainloop) Fusions

Supported

Supported

Partially Supported

g 2 (Epilogue) Fusions

Supported

Supported

Supported

Mixed Input Precision Matmul/Convolution

Supported

Supported

Not Supported

Grouped Convolution

Supported

Supported

Not Supported

The detailed supported features of each support surface are listed in the following subsections.

Support Surface 90

Compute Capability

  • NVIDIA GPUs with compute capability 9.0 are supported.

Generic Limitations

  • Strided ConvolutionBwdData fusions are not supported.

  • Pointwise and Reduction fusions are not supported.

Advanced Matmul/Convolution Variations

  • Mixed input precision Matmul, ConvolutionFwd, and ConvolutionBwdData fusions are supported.

  • Grouped ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData fusions are supported.

I/O and Intermediate Data Type

  • The input tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.

  • The input tensor data type of Matmul, ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData operations can be any of {FLOAT, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.

  • The output tensor data type can be any of {INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2}.

  • The output data types of CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations can only be FLOAT.

  • The intermediate virtual tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}, and this intermediate storage type is obeyed by the code-generator. Generally, FP32 is recommended.

Compute Data Type

  • Compute data type can be either FP32 or INT32 for CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.

  • Compute data type can only be FP32 for CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations.

  • The support surface of the compute data type of Matmul, ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.

Combinatory Support Surface of Input Data Type and Compute Data Type of Matmul and Convolution Operations

matmul / convolution Operation Input Data Type

matmul / convolution Operation Compute Data Type

INT8

INT32

FP8_E4M3, FP8_E5M2

FLOAT, FAST_FLOAT_FOR_FP8

HALF

FLOAT, HALF

BFLOAT16

FLOAT

FLOAT

FLOAT

Mainloop Fusions: g :sub:`1`

  • g 1 is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.

  • All the input tensors must have an alignment of 128 bits. For grouped ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData fusions, the alignment requirement is per group.

  • All the intermediate tensors must be virtual.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 1

Pattern

Dimension

Layout

Matmul fusions

  • Tensor A must have dimension dim[B, M, K] or dim[1, M, K].

  • Dimensions of input tensors to g 1A can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, K], dim[B, 1, K], dim[1, M, K], or dim[B, M, K].

  • Tensor B must have dimension dim[B, K, N] or dim[1, K, N].

  • Dimensions of input tensors to g 1B can be dim[1, 1, 1], dim[B, 1, 1], dim[1, 1, N], dim[B, 1, N], dim[1, K, 1], dim[B, K, 1], dim[1, K, N], or dim[B, K, N].

  • All tensors can be in either row-major or column-major layout.

  • The leading dimension must be fully packed.

ConvolutionFwd fusions

  • Tensor X must have dimension dim[N, C, (D,) H, W].

  • Dimensions of tensors operating with X can be dim[1, 1, (1,) 1, 1], dim[1, C, (1,) 1, 1], or dim[N, C, (D), H, W]. Exception: dim[1, C, (1,) 1, 1] is not compatible in grouped ConvolutionFwd fusions.

  • Tensor W must have dimension dim[K, C, (T,) R, S].

  • All tensors must be in NHWC layout.

  • The leading dimension must be fully packed.

ConvolutionBwdFilter fusions

  • Tensor dy must have dimension dim[N, K, (O,) P, Q].

  • Tensor X must have dimension dim[N, C, (D,) H, W].

  • Fusion operation in g 1 is not supported.

  • All tensors can be in either NHWC or CHWN layout.

  • For INT8, FP8_E4M3, and FP8_E5M2 data types, dy or X being in NHWC layout may result in low performance.

  • The leading dimension must be fully packed.

ConvolutionBwdData fusions

  • Tensor dy must have dimension dim[N, K, (O,) P, Q].

  • Dimensions of tensors operating with dy can be dim[1, 1, (1,) 1, 1], dim[1, K, (1,) 1, 1], or dim[N, K, (O,) P, Q].

  • Tensor W must have dimension dim[K, C, (T,) R, S].

  • All tensors can be in either NHWC or CHWN layout.

  • For INT8, FP8_E4M3, and FP8_E5M2 data types, dy being in CHWN layout or W being in NHWC layout may result in low performance.

  • The leading dimension must be fully packed.

Epilogue Fusion: g :sub:`2`

  • g 2 is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation and zero or one CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.

  • CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation can only be the exit node of g 2.

  • All the input and output tensors must have an alignment of 8 bits. For grouped ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData fusions, the alignment requirement is per group.

  • In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 2

Pattern

Dimension

Layout

Matmul fusions

  • Tensor C must have dimension dim[B, M, N] and be the first input operand of each operation.

  • Dimensions of other input tensors to g 2 can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, N], dim[B, 1, N], dim[1, M, N], or dim[B, M, N].

  • Dimensions of output tensors from g 2 can be dim[B, M, N].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, N], dim[B, 1, N], or dim[1, M, N].

  • All tensors can be in either row-major or column-major layout, but need to be the same.

  • The leading dimension must be fully packed.

  • If CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation presents, all tensor layouts must be row-major.

ConvolutionFwd fusions

  • Tensor Y must have dimension dim[N, K, (O,) P, Q] and be the first input operand of each operation.

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], dim[1, K, (1,) 1, 1], or dim[N, K, (O,) P, Q].

  • Dimensions of output tensors from g 2 can be dim[N, K, (O,) P, Q].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], dim[1, K, (1,) 1, 1], dim[N, K, (1,) 1, 1], or dim[N, 1, (1,) 1, 1].

  • Grouped ConvolutionFwd fusions cannot have CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation in g 2.

  • All tensors must be in an NHWC layout.

  • The leading dimension must be fully packed.

ConvolutionBwdFilter fusions

  • Tensor dw must have dimension dim[K, C, (T,) R, S] and be the first input operand of each operation.

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[1, C, (T,) R, S], dim[K, 1, (1,) 1, 1], or dim[K, C, (T,) R, S].

  • Dimensions of output tensors from g 2 can be dim[K, C, (T,) R, S].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[1, C, (T,) R, S], or dim[K, 1, (1,) 1, 1].

  • Grouped ConvolutionBwdFilter fusions cannot have CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation in g 2.

  • All tensors must be in an NHWC layout.

  • The leading dimension must be fully packed.

ConvolutionBwdData fusions

  • Tensor dx must have dimension dim[N, C, (D,) H, W].

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[N, 1, (D,) H, W], dim[1, C, (1,) 1, 1], or dim[N, C, (D,) H, W].

  • Dimensions of output tensors from g 2 can be dim[N, C, (D,) H, W].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[N, 1, (D,) H, W], dim[1, C, (1,) 1, 1], dim[N, C, (1,) 1, 1], or dim[N, 1, (1,) 1, 1].

  • Grouped ConvolutionBwdData fusions cannot have g 2.

  • All tensors must be in an NHWC layout.

  • The leading dimension must be fully packed.

Pointwise and Reduction fusions

Not Supported

Not Supported

Support Surface 80

Compute Capability

  • NVIDIA GPUs with compute capability 8.0, 8.6. 8.7, 8.9, and 9.0 are supported.

Generic Limitations

  • ConvolutionBwdFilter fusions are not supported.

  • ConvolutionBwdData fusions are not supported.

  • Pointwise and Reduction fusions are not supported.

Advanced Matmul/Convolution Variations

  • Mixed input precision Matmul and ConvolutionFwd fusions are supported.

  • Grouped ConvolutionFwd fusions are supported.

I/O and Intermediate Data Type

  • The input tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.

  • The input tensor data type of Matmul and ConvolutionFwd operations can be any of {FLOAT, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}. FP8_E4M3 and FP8_E5M2 input tensor data type of Matmul and ConvolutionFwd operations are only available with compute capability 8.9.

  • The output tensor data type can be any of {INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2}.

  • The output data types of CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations can only be FLOAT.

  • The intermediate virtual tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}, and this intermediate storage type is obeyed by the code-generator. Generally, FP32 is recommended.

  • FP8_E4M3 and FP8_E5M2 input, output, and intermediate tensor data type is only available with compute capability 8.9 and 9.0.

Compute Data Type

  • Compute data type can be either FP32 or INT32 for CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.

  • Compute data type can only be FP32 for CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations.

  • The support surface of the compute data type of Matmul, ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.

Combinatory Support Surface of Input Data Type and Compute Data Type of Matmul and Convolution Operations

matmul / convolution Operation Input Data Type

matmul / convolution Operation Compute Data Type

Note

INT8

INT32

FP8_E4M3, FP8_E5M2

FLOAT, FAST_FLOAT_FOR_FP8

Available with compute capability 8.9 only

HALF

FLOAT, HALF

BFLOAT16

FLOAT

FLOAT

FLOAT

Mainloop Fusions: g :sub:`1`

  • g 1 is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation.

  • All the input tensors must have an alignment of 32 bits. For ConvolutionFwd fusions with no operations in g 1, the input tensors can have an alignment of 8 bits. For grouped ConvolutionFwd fusions, the alignment requirement is per group.

  • All the intermediate tensors must be virtual.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 1

Pattern

Dimension

Layout

Matmul fusions

  • Tensor A must have dimension dim[B, M, K] or dim[1, M, K].

  • Dimensions of input tensors to g 1A can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, K], dim[B, 1, K], dim[1, M, K], or dim[B, M, K].

  • Tensor B must have dimension dim[B, K, N] or dim[1, K, N].

  • Dimensions of input tensors to g 1B can be dim[1, 1, 1], dim[B, 1, 1], dim[1, 1, N], dim[B, 1, N], dim[1, K, 1], dim[B, K, 1], dim[1, K, N], or dim[B, K, N].

  • All tensors can be in either row-major or column-major layout.

  • The leading dimension must be fully packed.

ConvolutionFwd fusions

  • Tensor X must have dimension dim[N, C, (D,) H, W].

  • Dimensions of tensors operating with X can be dim[1, 1, (1,) 1, 1], dim[1, C, (1,) 1, 1], or dim[N, C, (D), H, W]. Exception: dim[1, C, (1,) 1, 1] is not compatible in grouped ConvolutionFwd fusions.

  • Tensor W must have dimension dim[K, C, (T,) R, S].

  • All tensors must be in an NHWC layout.

  • The leading dimension must be fully packed.

ConvolutionBwdFilter fusions

Not Supported

Not Supported

ConvolutionBwdData fusions

Not Supported

Not Supported

Epilogue Fusion: g :sub:`2`

  • g 2 is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation and zero or one CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.

  • CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation can only be the exit node of g 2.

  • All the input and output tensors must have an alignment of 8 bits. For grouped ConvolutionFwd fusions, the alignment requirement is per group.

  • In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 2

Pattern

Dimension

Layout

Matmul fusions

  • Tensor C must have dimension dim[B, M, N] and be the first input operand of each operation.

  • Dimensions of other input tensors to g 2 can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, N], dim[B, 1, N], dim[1, M, N], or dim[B, M, N].

  • Dimensions of output tensors from g 2 can be dim[B, M, N].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, N], dim[B, 1, N], or dim[1, M, N].

  • All tensors can be in either row-major or column-major layout, but need to be the same.

  • The leading dimension must be fully packed.

  • If CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation presents, all tensor layouts must be row-major.

ConvolutionFwd fusions

  • Tensor Y must have dimension dim[N, K, (O,) P, Q] and be the first input operand of each operation.

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], dim[1, K, (1,) 1, 1], or dim[N, K, (O,) P, Q].

  • Dimensions of output tensors from g 2 can be dim[N, K, (O,) P, Q].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], dim[1, K, (1,) 1, 1], or dim[N, 1, (1,) 1, 1].

  • Grouped ConvolutionFwd fusions cannot have CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation in g 2.

  • All tensors must be in an NHWC layout.

  • The leading dimension must be fully packed.

ConvolutionBwdFilter fusions

Not Supported

Not Supported

ConvolutionBwdData fusions

Not Supported

Not Supported

Pointwise and Reduction fusions

Not Supported

Not Supported

Support Surface 70

Support Surface of Compute Capability

  • NVIDIA GPUs with compute capability 7.0, 7.2, 7.5, 8.0, 8.6, 8.7, 8.9, and 9.0 are supported.

I/O and Intermediate Data Type

  • The input tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.

  • The input tensor data type of Matmul, ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData operations can be any of {FLOAT, HALF, BFLOAT16, INT8}.

  • The output tensor data type can be any of {INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2, BOOLEAN}.

  • The output data types of CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations can only be FLOAT.

  • The intermediate virtual tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2, BOOLEAN}, and this intermediate storage type is obeyed by the code-generator. Generally, FP32 is recommended.

  • FP8_E4M3 and FP8_E5M2 data types are only allowed in pure Pointwise and Reduction fusions.

Compute Data Type

  • Compute data type can be FP32 or BOOLEAN for CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.

  • Compute data type can only be FP32 for CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations.

  • The support surface of the compute data type of Matmul, ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.

Combinatory Support Surface of Input Data Type and Compute Data Type of Matmul and Convolution Operations

Matmul / Convolution Operation Input Data Type

Matmul / Convolution Operation Compute Data Type

Note

INT8

INT32

Not available for ConvolutionBwdFilter and ConvolutionBwdData fusions

HALF

FLOAT, HALF

Available with compute capability 8.9 only

HALF

FLOAT, HALF

BFLOAT16

FLOAT

FLOAT

FLOAT

Mainloop Fusions: g :sub:`1`

  • g 1 is a directed acyclic graph (DAG) that can consist of zero or any number of the following operations:

    • CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR

  • CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR or CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR operations, if present, should be before any Pointwise operations.

  • For compute capability < 8.0, g 1 is not supported.

  • All the input tensors must have an alignment of 32 bits.

  • All the intermediate tensors must be virtual.

  • In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 1

Pattern

Dimension

Layout

Matmul fusions

  • Tensor A must have dimension dim[B, M, K] and be the first input operand of each operation.

  • If g 1 presents, Tensor A must be in HALF data type, the other input tensors broadcasted and operated with tensor A can have any data type.

  • Dimensions of other input tensors to g 1A can be dim[1, 1, 1], dim[B, M, 1], dim[B, 1, K], or dim[B, M, K]. If the input tensor is in dimension dim[B, M, K], it must have data type HALF as well.

  • Tensor B must have dimension dim[B, K, N].

  • Fusion operation in g 1B is not supported.

  • All tensors can be in either fully packed row-major layout or fully packed column-major layout.

  • All input tensors to g 1A must have the same layout.

  • If the data type of the input tensor to the matmul operation is INT8, all the input tensors to g 1A must be in row-major layout and the tensor B must be in column-major layout.

ConvolutionFwd fusions

  • Tensor X must have dimension dim[N, C, (D,) H, W] and be the first input operand of each operation.

  • Tensor W must have dimension dim[K, C, (T,) R, S].

  • Fusion operations on X tensor can be only a chain of three specific Pointwise operations, in this exact order: CUDNN_POINTWISE_MUL, CUDNN_POINTWISE_ADD, and CUDNN_POINTWISE_RELU_FWD. This specific support is added to realize convolution batch norm fusion use cases.

  • All tensors involved can only be HALF data type.

  • CUDNN_POINTWISE_MUL and CUDNN_POINTWISE_ADD can only be operated with a tensor with dimension dim[1, C, (1,) 1, 1].

All tensors must be in a fully packed NHWC layout.

ConvolutionBwdFilter fusions

  • Tensor dy must have dimension dim[N, K, (O,) P, Q] and be the first input operand of each operation.

  • Tensor X must have dimension dim[N, C, (D,) H, W].

  • Fusion operations on X tensor can be only a chain of three specific Pointwise operations, in this exact order: CUDNN_POINTWISE_MUL, CUDNN_POINTWISE_ADD, and CUDNN_POINTWISE_RELU_FWD. This specific support is added to realize convolution batch norm fusion use cases.

  • All tensors involved can only be HALF data type.

  • CUDNN_POINTWISE_MUL and CUDNN_POINTWISE_ADD can only be operated with a tensor with dimension dim[1, C, (1,) 1, 1].

All tensors must be in a fully packed NHWC layout.

ConvolutionBwdData fusions

  • Tensor dy must have dimension dim[N, K, (O,) P, Q] and be the first input operand of each operation.

  • Tensor W must have dimension dim[K, C, (T,) R, S].

  • Fusion operation in g 1 is not supported.

All tensors must be in a fully packed NHWC layout.

Epilogue Fusion: g :sub:`2`

  • g 2 is a directed acyclic graph (DAG) that can consist of zero or any number of the following operations:

    • CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR

and zero or one CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.

  • CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation can only be the exit node of g 2.

  • CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR operations, if present, must be the final nodes in g 2. Hence, CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR operations cannot be used in conjunction with the CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.

  • The input tensor to a CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR or CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR operation should not be produced by another operation within this graph, but should come from global memory. These two operations cannot be used in the Matmul, ConvolutionBwdFilter, and ConvolutionBwdData fusions, and are only supported with compute capability >= 7.5.

  • All the input and output tensors must have an alignment of 32 bits, except the output of a CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation can have an alignment of 8 bits.

  • In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 2

Pattern

Dimension

Layout

Matmul fusions

  • Tensor C must have dimension dim[B, M, N] and be the first input operand of each operation.

  • Dimensions of other input tensors to g 2 can be dim[1, 1, 1], dim[B, M, 1], dim[B, 1, N], or dim[B, M, N].

  • Dimensions of output tensors from g 2 can be dim[B, M, N].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, 1], dim[B, M, 1], or dim[B, 1, N].

  • All tensors can be in either fully packed row-major layout or fully packed column-major layout, but need to be the same.

  • If CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation presents, all tensor layouts must be row-major.

ConvolutionFwd fusions

  • Tensor Y must have dimension dim[N, K, (O,) P, Q] and be the first input operand of each operation.

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], dim[1, K, (1,) 1, 1], or dim[N, K, (O,) P, Q].

  • Dimensions of output tensors from g 2 can be dim[N, K, (O,) P, Q].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], or dim[1, K, (1,) 1, 1].

All tensors must be in a fully packed NHWC layout.

ConvolutionBwdFilter fusions

  • Tensor dw must have dimension dim[K, C, (T,) R, S] and be the first input operand of each operation.

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[1, C, (T,) R, S], dim[K, 1, (1,) 1, 1], or dim[K, C, (T,) R, S].

  • Dimensions of output tensors from g 2 can be dim[K, C, (T,) R, S].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[1, C, (T,) R, S], or dim[K, 1, (1,) 1, 1].

All tensors must be in a fully packed NHWC layout.

ConvolutionBwdData fusions

  • Tensor dx must have dimension dim[N, C, (D,) H, W] and be the first input operand of each operation.

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[N, 1, (D,) H, W], dim[1, C, (1,) 1, 1], or dim[N, C, (D,) H, W].

  • Dimensions of output tensors from g 2 can be dim[N, C, (D,) H, W].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[N, 1, (D,) H, W], or dim[1, C, (1,) 1, 1].

All tensors must be in a fully packed NHWC layout.

Pointwise and Reduction fusions

  • If all tensors are 3D, the same dimension requirements as Matmul g 2.

  • If all tensors are 4D or 5D, the same dimension requirements as ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData g 2.

  • CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation does not support 3D tensors.

  • If all tensors are 3D, the same layout requirements as Matmul g 2.

  • If all tensors are 4D or 5D, the same layout requirements as ConvolutionFwd, ConvolutionBwdFilter, and ConvolutionBwdData g 2.

Operation Specific Constraints for the Runtime Fusion Engines

Every operation in the supported generic patterns of the runtime fusion engines is subject to a few specific constraints regarding their parameter surface. The following subsections document these.

Note that these constraints are in addition to (1) any constraints mentioned in the Backend Descriptor Types, and (2) limitations in relation to other operations in the directed acyclic graph (DAG), as mentioned in the Support Surface section.

Matmul

This operation represents matrix-matrix multiplication: A * B = C. For complete details on the interface, refer to the CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR section.

Convolutions

There are three operation nodes that represent different types of convolutions namely:

ConvolutionFwd

This operation represents forward convolution, that is, computing the response tensor of image tensor convoluted with filter tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR section.

ConvolutionBwdFilter

This operation represents convolution backward filters, that is, computing filter gradients from a response and an image tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR section.

ConvolutionBwdData

This operation represents convolution backward data, that is, computing input data gradients from a response and a filter tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR section.

Tensor Attributes for all Three Operations

Input Tensor Attribute Name

Output Tensor Attribute Name

ConvolutionFwd

CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X, CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W

CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y

ConvolutionBwdFilter

CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX, CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY

CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W

ConvolutionBwdData

CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW, CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY

CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X

The FP8 data type since NVIDIA Ada Lovelace architecture has two variants: CUDNN_DATA_FP8_E4M3 and CUDNN_DATA_FP8_E5M2 as I/O data types. Using them as inputs to the operation will result in FP8 Tensor Cores being used. The precision of the accumulation inside the FP8 Tensor Cores is controlled by the compute type, which may have one of two possible values: CUDNN_DATA_FLOAT and CUDNN_DATA_FAST_FLOAT_FOR_FP8.

CUDNN_DATA_FAST_FLOAT_FOR_FP8 is faster and sufficiently accurate for inference or the forward pass of training. However, for FP8 training backward pass computations (that is, computing weight and activation gradients), we recommend choosing the more accurate CUDNN_DATA_FLOAT compute type to preserve a higher level of accuracy which may be necessary for some models.

Recommended Compute Type for FP8 Tensor Computations for Ada Lovelace and Hopper Architecture

Operation

Recommended I/O Type

Recommended Compute Type

ConvolutionFwd

CUDNN_DATA_FP8_E4M3

CUDNN_DATA_FAST_FLOAT_FOR_FP8, CUDNN_DATA_FLOAT

ConvolutionBwdData

CUDNN_DATA_FP8_E4M3

CUDNN_DATA_FLOAT

BatchNorm

CUDNN_DATA_FP8_E4M3

CUDNN_DATA_FLOAT

Pooling

CUDNN_DATA_FP8_E4M3, CUDNN_DATA_FP8_E5M2

CUDNN_DATA_FLOAT

Pointwise

CUDNN_DATA_FP8_E4M3, CUDNN_DATA_FP8_E5M2

CUDNN_DATA_FLOAT

Pointwise

Represents a pointwise operation that implements the equation Y = op (alpha1 * X) or Y = op (alpha1 * X, alpha2 * B). Refer to the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR and CUDNN_BACKEND_POINTWISE_DESCRIPTOR sections for more information and general constraints.

The following tables list the constraints for Pointwise operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when these operations are used in the runtime fusion engines.

Constraints for Pointwise Operations for Support Surface 90 and 80

Attribute

Requirement

Tensor data type for CUDNN_ATTR_OPERATION_POINTWISE_XDESC, CUDNN_ATTR_OPERATION_POINTWISE_YDESC and, if applicable, CUDNN_ATTR_OPERATION_POINTWISE_BDESC, CUDNN_ATTR_OPERATION_POINTWISE_TDESC

For all operators, all data types are supported.

CUDNN_ATTR_POINTWISE_MATH_PREC

  • For any of the logical operators (CUDNN_POINTWISE_LOGICAL_AND, CUDNN_POINTWISE_LOGICAL_OR, and CUDNN_POINTWISE_LOGICAL_NOT), math precision needs to be CUDNN_DATA_INT32.

  • For any of the following operations (CUDNN_POINTWISE_ADD, CUDNN_POINTWISE_ADD_SQUARE, CUDNN_POINTWISE_DIV, CUDNN_POINTWISE_MAX, CUDNN_POINTWISE_MIN, CUDNN_POINTWISE_MOD, CUDNN_POINTWISE_ABS, CUDNN_POINTWISE_CEIL, CUDNN_POINTWISE_FLOOR, CUDNN_POINTWISE_MUL, CUDNN_POINTWISE_SUB, CUDNN_POINTWISE_NEG, CUDNN_POINTWISE_CMP_EQ, CUDNN_POINTWISE_CMP_NEQ, CUDNN_POINTWISE_CMP_GT, CUDNN_POINTWISE_CMP_GE, CUDNN_POINTWISE_CMP_LT, CUDNN_POINTWISE_CMP_LE, CUDNN_POINTWISE_GEN_INDEX, CUDNN_POINTWISE_BINARY_SELECT), math precision can be either CUDNN_DATA_FLOAT or CUDNN_DATA_INT32.

  • For any of the CUDNN_POINTWISE_IDENTITY operations, the math precision can be any data type. However, if the math precision is other than CUDNN_DATA_INT32 or CUDNN_DATA_FLOAT, the input data type, the output data type, and the math precision must be the same.

  • For all other operators, only CUDNN_DATA_FLOAT math precision is supported.

CUDNN_ATTR_OPERATION_POINTWISE_ALPHA1

1.0f

CUDNN_ATTR_OPERATION_POINTWISE_ALPHA2

1.0f

Constraints for Pointwise Operations for Support Surface 70

Attribute

Requirement

Tensor data type for CUDNN_ATTR_OPERATION_POINTWISE_XDESC, CUDNN_ATTR_OPERATION_POINTWISE_YDESC and, if applicable, CUDNN_ATTR_OPERATION_POINTWISE_BDESC

  • For any of the logical operators (CUDNN_POINTWISE_LOGICAL_AND, CUDNN_POINTWISE_LOGICAL_OR, and CUDNN_POINTWISE_LOGICAL_NOT), data type can be any of CUDNN_DATA_INT32, CUDNN_DATA_INT8, or CUDNN_DATA_BOOLEAN.

  • For all other operators, all data types are supported.

CUDNN_ATTR_POINTWISE_MATH_PREC

  • For any of the logical operators (CUDNN_POINTWISE_LOGICAL_AND, CUDNN_POINTWISE_LOGICAL_OR, and CUDNN_POINTWISE_LOGICAL_NOT), math precision needs to be CUDNN_DATA_BOOLEAN.

  • For all other operators, only CUDNN_DATA_FLOAT is supported.

CUDNN_ATTR_OPERATION_POINTWISE_ALPHA1

1.0f

CUDNN_ATTR_OPERATION_POINTWISE_ALPHA2

1.0f

GenStats

Represents an operation that generates per-channel statistics. Refer to the CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR section for more information and general constraints.

The following table lists the constraints for GenStats operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when GenStats operations are used in the runtime fusion engines.

Constraints for GenStats Operations

Attribute

Requirement

Tensor data type for CUDNN_ATTR_OPERATION_GENSTATS_XDESC

  • Prior to the NVIDIA Ampere architecture GPU: CUDNN_DATA_HALF

  • On NVIDIA Ampere architecture and later: CUDNN_DATA_HALF and CUDNN_DATA_FLOAT

Tensor shape for CUDNN_ATTR_OPERATION_GENSTATS_SUMDESC and CUDNN_ATTR_OPERATION_GENSTATS_SQSUMDESC

Both should be of shape [1, C, 1, 1] for 2D conv or [1, C, 1, 1, 1] for 3D conv.

Tensor data type for CUDNN_ATTR_OPERATION_GENSTATS_SUMDESC and CUDNN_ATTR_OPERATION_GENSTATS_SQSUMDESC

CUDNN_DATA_FLOAT

CUDNN_ATTR_POINTWISE_MATH_PREC

CUDNN_DATA_FLOAT

Tensor layout for CUDNN_ATTR_OPERATION_GENSTATS_XDESC, CUDNN_ATTR_OPERATION_GENSTATS_SUMDESC, and CUDNN_ATTR_OPERATION_GENSTATS_SQSUMDESC

NHWC fully packed

Reduction

This operation represents reducing values of a tensor in one or more dimensions. Refer to the CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR section for more information and general constraints.

The following table lists constraints for Reduction forward operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when Reduction operations are used in the runtime fusion engines.

Constraints for Reduction Operations

Attribute

Requirement

Tensor data type for CUDNN_ATTR_OPERATION_REDUCTION_YDESC

CUDNN_DATA_FLOAT

CUDNN_ATTR_REDUCTION_COMP_TYPE

CUDNN_DATA_FLOAT

Tensor layout for CUDNN_ATTR_OPERATION_REDUCTION_XDESC and CUDNN_ATTR_OPERATION_REDUCTION_YDESC

NHWC/NDHWC/BMN fully packed

CUDNN_ATTR_REDUCTION_OPERATOR

CUDNN_REDUCE_TENSOR_ADD, CUDNN_REDUCE_TENSOR_MIN, and CUDNN_REDUCE_TENSOR_MAX

ResampleFwd

This operation represents resampling of the spatial dimensions of an image to a desired value. Resampling is supported in both directions, upsampling and downsampling. Downsampling represents the standard operation of pooling, commonly used in convolutional neural networks. Refer to the CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR section for more information and general constraints.

The following are constraints for ResampleFwd operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when ResampleFwd operations are used in the runtime fusion engines.

We allow a choice amongst four modes for resample. All modes have the following common support specifications:

  • Supported layout: NHWC or NDHWC, NCHW or NCDHW

  • Spatial dimensions supported: 2 or 3

  • Input dimensions supported: 4 or 5

  • Packed boolean data type is not supported.

  • If specified, the index tensor dimension should be equal to the response tensor dimension.

When the tensor format is NCHW/NCDHW, the following additional restrictions apply:

  • Upsampling is not supported.

  • Int64_t indices are not supported.

  • Only supports symmetric padding using the prepadding backend API.

There are some mode specific restrictions also. The following tables list the values that are allowed for particular parameters. For the parameters not listed, we allow any value which is mathematically correct.

The following downsampling modes are supported:

  • CUDNN_RESAMPLE_AVGPOOL_INCLUDE_PADDING

  • CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING

  • CUDNN_RESAMPLE_MAXPOOL

Specific Restrictions for the Downsampling Modes

Attribute

Average Pooling

Max Pooling

CUDNN_ATTR_RESAMPLE_PADDING_MODE

CUDNN_ZERO_PAD

CUDNN_NEG_INF_PAD

CUDNN_ATTR_OPERATION_RESAMPLE_FWD_ALPHA

1.0

1.0

CUDNN_ATTR_OPERATION_RESAMPLE_FWD_BETA

0.0

0.0

CUDNN_ATTR_RESAMPLE_COMP_TYPE

CUDNN_DATA_FLOAT

CUDNN_DATA_FLOAT

For the upsampling modes, CUDNN_RESAMPLE_NEAREST is not supported for any combination of parameters. CUDNN_RESAMPLE_BILINEAR has the following support specifications.

Specific Restrictions for Upsampling Mode CUDNN_RESAMPLE_BILINEAR

Attribute

Bilinear

Input dimensions

Equal to 0.5 x output dimensions

CUDNN_ATTR_RESAMPLE_PRE_PADDINGS

0.5

CUDNN_ATTR_RESAMPLE_POST_PADDINGS

1

CUDNN_ATTR_RESAMPLE_STRIDES

0.5

CUDNN_ATTR_RESAMPLE_WINDOW_DIMS

2

Data type for CUDNN_ATTR_OPERATION_RESAMPLE_FWD_XDESC and CUDNN_ATTR_OPERATION_RESAMPLE_FWD_YDESC

CUDNN_DATA_FLOAT

CUDNN_ATTR_RESAMPLE_COMP_TYPE

CUDNN_DATA_FLOAT

CUDNN_ATTR_OPERATION_RESAMPLE_FWD_ALPHA

1.0

CUDNN_ATTR_OPERATION_RESAMPLE_FWD_BETA

0.0

CUDNN_ATTR_RESAMPLE_PADDING_MODE

CUDNN_EDGE_VAL_PAD

Resampling Index Tensor Dump for Training

For max-pooling resampling mode, an index tensor can be provided to be used as a mask for backpropagation.

Values in the index tensors are:

  • Zero-indexed row-major position of maximum value of input tensor in the resampling window.

  • In case of multiple input pixels with maximum value, the first index in a left-to-right top-to-bottom scan is selected.

An example of index element selection:

Values in the Index Tensors

Select an appropriate element size for the index tensor. As a reference, any element size such that the maximum zero-indexed window position fits should be sufficient.

ResampleBwd

This operation represents backward resampling of the spatial dimensions of an output response to a desired value. Resampling is supported in both directions, upsampling and downsampling. Backwards downsampling represents the standard operation of backward pooling, commonly used in convolutional neural networks. Refer to the CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR section for more information and general constraints.

The following are constraints for Resample backward operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when Resample backward operations are used in the runtime fusion engines.

We allow a choice amongst four modes for resample. All modes have the following common support specifications:

  • Supported layout: NHWC or NDHWC, NCHW or NCDHW

  • Spatial dimensions supported: 2 or 3

  • Input dimensions supported: 4 or 5

For layout NHWC or NDHWC:

  • The index tensor should be provided for only max pooling mode, and should adhere to the format described in the Resampling Index Tensor Dump for Training section.

  • The index tensor dimensions should be equal to the input gradient tensor dimensions.

For layout NCHW or NCDHW:

  • X, Y, and DY are required when max pooling mode is used.

  • Int64_t indices are not supported.

There are some mode specific restrictions also. The following tables list the values that are allowed for particular parameters. For the parameters not listed, we allow any value which is mathematically correct.

The following backward downsampling modes are supported:

  • CUDNN_RESAMPLE_AVGPOOL_INCLUDE_PADDING

  • CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING

  • CUDNN_RESAMPLE_MAXPOOL

Specific Restrictions for the Backwards Downsampling Modes

Attribute

Average Pooling

Max Pooling

CUDNN_ATTR_RESAMPLE_PADDING_MODE

CUDNN_ZERO_PAD

CUDNN_NEG_INF_PAD

CUDNN_ATTR_OPERATION_RESAMPLE_BWD_ALPHA

1.0

1.0

CUDNN_ATTR_OPERATION_RESAMPLE_BWD_BETA

0.0

0.0

CUDNN_ATTR_RESAMPLE_COMP_TYPE

CUDNN_DATA_FLOAT

CUDNN_DATA_FLOAT

Backward upsampling modes are currently not supported.

Examples of Supported Patterns

The following sections provide examples of supported patterns, in order of increasing complexity. We employ the same color scheme as in the overall pattern to aid in identifying the structure of g 1 (blue) and g 2 (purple).

For illustration purposes, we abbreviated the operations used. For a full mapping to the actual backend descriptors, refer to the Mapping with Backend Descriptors.

Single Operation

The following example illustrates a convolution operation without any operations before or after it. This means, g 1 and g 2, are empty graphs.

Runtime Fusion Engines with a Single Operation
Pointwise Operations After Convolution 1

In this example, g 2 consists of a sequential set of two Pointwise operations after the convolution.

ConvolutionFwd Followed by a DAG with Two Operations
Pointwise Operations After Convolution 2

Similar to the previous example, g 2 consists of a sequential set of multiple Pointwise operations.

ConvolutionFwd Followed by a DAG with Three Operations
Pointwise Operations Before Matrix Multiplication

Pointwise operations can also precede a convolution or matrix multiplication, that is, g 1 is composed of Pointwise operations.

Matmul Preceded by a DAG with Two Operations
Convolution Producer Node in Middle of DAG

The following pattern shows g 1 as a DAG of Pointwise operations feeding into a convolution. In addition, g 2 is a DAG consisting of two Pointwise operations. Note that the convolution is being consumed in the middle of g 2 as opposed to g 2 first node. This is a valid pattern.

Fusion of Operations Before and After the ``ConvolutionFwd`` operation. In addition, we Observe that the Output of ``ConvolutionFwd`` can Feed Anywhere in g :sub:`2`.
Mixed Input Precision Matmul and Convolution

Mixed input precision for matmuls and convolutions is implemented as a special case of mainloop fusion. Inputs may have different data types and will be converted to the desired data types serving as the inputs of the matmul or convolution operation by a Pointwise:Identity operation. The following pattern shows g 1 as a DAG of Pointwise:Identity operation converting the input data type of tensor A into a matmul operation. This is a valid pattern.

Mixed Precision matmul

Specialized Runtime Fusion Engines

The specialized runtime fusion engines target and optimize specialized graph patterns that commonly occur in popular deep learning models. These engines offer limited flexibility regarding supported fusion patterns, supported data types, and supported tensor layouts. Long term, these patterns are expected to be more generic.

The following sections highlight the supported patterns.

BnAddRelu

In ResNet-like vision models, batch normalization followed by ReLU activation is a commonly occurring pattern. The BNAddRelu fusion pattern, supported using a runtime compiled engine, aims to optimize this recurring operation graph. It also supports single node multi-GPU batch normalization for speeding up batch norm computation in multi-GPU systems. The pattern is intended for use in the forward pass during the training phase. The full pattern BNAddRelu with the add node is used in cases where there are skip connections in the model.

The pattern is illustrated in the following diagram and its options and limitations include:

  • The pointwise nodes: Add, ReLU, and GT (greater than) are optional.

  • All tensors should be in NHWC packed layout format.

  • Both 4D and 5D tensors are supported.

  • Only ReLU activation is supported.

  • The attribute CUDNN_ATTR_OPERATION_NORM_FWD_MODE for the norm forward operation must be set to CUDNN_BATCH_NORM.

  • The attribute CUDNN_ATTR_OPERATION_NORM_FWD_PHASE for the norm forward operation must be set to CUDNN_NORM_FWD_TRAINING.

  • The batch norm input tensors: Scale, Bias, Input_running_mean, and Input_running_var must be of float data type.

  • The batch norm output tensors: output_running_mean, output_running_var, mean, and InvVariance must be of float data type.

  • The batch norm input tensor BN_x, residual input Z and output tensor Y can be any of {FP32, FP16, BF16} data types. For FP16 and BF16 data types, the channel count C for the tensors must be a multiple of 8 while for float data type the channel count must be a multiple of 4.

  • These patterns are supported on devices with compute capability >= 8.0.

BnAddRelu cuDNN Operation Graph

In case of single node multi-GPU batch norm, each GPU computes the local statistics based on its input data and writes out the local statistics to the peerTensors. Each peerTensor resides on a separate GPU on the node and is used for reading and writing local statistics from the peer GPUs. This is followed by a global stats computation phase where each GPU aggregates the statistics from the peers and computes the global mean and variance for the batch norm output computation on its local data. Apart from the options and limitations listed above, the following additional restrictions apply for using multi-GPU batch norm:

  • The attribute CUDNN_ATTR_OPERATION_NORM_FWD_PEER_STAT_DESCS of the NormForward operation must be set.

  • The size of the peerTensors vector should be equal to the number of GPUs in the node participating in the batch norm computation.

  • The maximum size of the peerTensors vector is 32.

  • Each GPU should operate on the same size of input data [N,C,H,W].

  • The size of each peerTensor in the peerTensors vector should be equal to num_gpu * 4 * C where C is the channel count of the BN_x tensor and num_gpu is the number of GPUs in the node participating in the batch norm computation.

  • All the elements of each tensor in the peerTensors vector should be memset to 0 before passing that tensor in the variant pack.

Single Node Multi-GPU Batch Norm

DReluForkDBn

Similar to the BnAddRelu pattern, the DReluForkDBn pattern also targets ResNet-like vision networks. It is intended to be used in backpropagation during the training phase. The DReluForkDBn pattern is supported through a runtime compiled engine that usually complements the BnAddRelu pattern. It also supports single node multi-GPU batch normalization for speeding up batch norm backward computation in multi-GPU systems.

The pattern is illustrated in the following diagram and its options and limitations include:

  • The pointwise node dRelu is optional.

  • The intermediate tensor dZ can be virtual or non-virtual.

  • All tensors should be in NHWC packed layout format.

  • Both 4D and 5D tensors are supported.

  • Only dRelu activation is supported.

  • Bitmask tensor input is needed for the dRelu node.

  • The attribute CUDNN_ATTR_OPERATION_NORM_BWD_MODE for the norm backward operation must be set to CUDNN_BATCH_NORM.

  • The batch norm backward input tensors: Scale, Mean, InvVariance and the output tensors dScale and dBias must be of float data type.

  • dRelu input tensor dY, batch norm backward input BN_x, bias gradient dZ, and output tensor dX can be any of {FP32, FP16, BF16} data types. For FP16 and BF16 data types, the channel count C for the tensors must be a multiple of 8 while for float data type the channel count must be a multiple of 4.

  • These patterns are supported on devices with compute capability >= 8.0.

DReluForkDBn cuDNN Operation Graph

The single node multi-GPU version of this pattern is typically used for dScale and dBias gradient aggregation across GPUs. For using the multi-GPU version, the attribute CUDNN_ATTR_OPERATION_NORM_BWD_PEER_STAT_DESCS of the NormBackward operation must be set. Other restrictions for the peerTensors vector listed in the previous section apply for this pattern as well.

Fused Attention fprop

Mha-Fprop fusions \(O=matmul\left( S=g_{4} \left( P=matmul\left( Q, g_{3}\left( K \right) \right), V \right)\right)\) have been added to the runtime fusion engine to serve patterns that are commonly used in attention. These patterns can be used in BERT, T5, and so on.

There are two key differences to the flash fused attention patterns described in later sections:

  1. Input sizes supported contain small sequence lengths (<= 512).

  2. The operation graph is flexible to switch between different types of masks, different operations between the two matrix multiplications, and so on.

Mha-fprop cuDNN Operation Graph

g 3 can be an empty graph or a single scale operation with the scale being a scalar value (CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_MUL).

g 4 can be empty or the combination of the following DAGs of cuDNN operations. Each of these DAGs is optional, as shown by the dotted line.

DAGs of cuDNN operations

The combination has to obey the order in which we present them. For example, if you want to use the padding mask and softmax, the padding mask has to appear before softmax.

These operations are commonly used in attention. In the following diagram, we depict how to create a DAG for each of the operations. In later versions, we will be expanding the possible DAGs for g 3 and g 4.

Padding Mask

cuDNN graph depicting DAG:Padding Mask

Causal Mask

cuDNN graph depicting DAG:Causal Mask

Softmax

cuDNN graph depicting DAG:Softmax

Dropout

cuDNN graph depicting DAG:Dropout

g 4 is capable of storing an intermediate tensor to global memory marked as S, which can be used for fused multi-head attention bprop. Both DAG:Softmax and DAG:Dropout have this capability. Set S as the output from the last DAG in the graph.

The tensor descriptor marked as S must have the CUDNN_ATTR_TENSOR_REORDERING_MODE set to CUDNN_TENSOR_REORDERING_F16x16. This is because the tensor is stored in a special format and can only be consumed by fused attention bprop.

There is an additional option of generating the mask on the user end and passing it directly to the pointwise multiplier. The mask needs to be of I/O data type FP16/BF16 and S will store the mask in the sign bit to communicate to bprop.

Limitations Of Mha-fprop Fusions

Limitation

Matmul

  • Compute type for both matmul ops must be float.

  • Input tensors must have data type FP16 or BF16.

  • Output tensors must have data type FP16, BF16, or FP32 (TF32).

Pointwise operations in g 3 and g 4

Compute type must be FP32 (TF32).

Reduction operations in g 3 and g 4

I/O types and compute types must be FP32 (TF32).

RNG operation in g 3 and g 4

  • Data type of yTensor must be FP32 (TF32).

  • The CUDNN_TYPE_RNG_DISTRIBUTION must be CUDNN_RNG_DISTRIBUTION_BERNOULLI.

Layout requirements of Mha-fprop fusions include:

  • All I/O tensors must have 4 dimensions, with the first two denoting the batch dimensions. The usage of rank-4 tensors in matmul ops can be read from the Backend Descriptor Types documentation.

  • The contracting dimension (dimension K) for the first matmul must be 64.

  • The non-contracting dimension (dimensions M and N) for the first matmul must be less than or equal to 512. In inference mode, any sequence length is functional. For training, support exists only for multiples of 64.

  • The last dimension (corresponding to hidden dimensions) in Q, V, and O is expected to have stride 1.

  • For the K tensor, the stride is expected to be 1 for the 2nd last dimension.

  • The S tensor is expected to have CUDNN_ATTR_TENSOR_REORDERING_MODE set to CUDNN_TENSOR_REORDERING_F16x16.

Fused Attention bprop

Mha-Bprop fusions are executed in a fused pattern in a single kernel.

\(dV=matmul\left( g_{5}\left( S \right), dO \right)\) \(dS=matmul\left( dO, VT\right)\) \(dQ=matmul\left( g_{6}\left( dS \right), K \right)\) \(dK=matmul\left( Q, g_{7}\left( dS \right)\right)\)

cuDNN supports the corresponding backpropagation graph for fused attention. This can be used together with the fused attention fprop graph to perform training on models that have similar architectures to BERT and T5. This is not compatible with the flash fused attention bprop operation graph.

Mha-bprop cuDNN Operation Graph

g 5, g 6, and g 7 can only support a fixed DAG. We are working towards generalizing these graphs.

cuDNN Graph Depicting g :sub:`5`

g 6 represents the backward pass of softmax and masking, to get dP.

cuDNN Graph Depicting g :sub:`6`

There are options for the Mask DAG that you can opt-in. You can either use the padding/causal mask, general mask as an input, or not do any masking.

cuDNN Graph Depicting Mask DAG

dBias DAG is useful to calculate the bprop of the relative positional encoding and is optional and available for you to opt-in.

cuDNN Graph Depicting dBias DAG

g 7 is the transpose of dP the output of g 6.

cuDNN Graph Depicting g :sub:`7`
Limitations Of Mha-bprop Fusions

Limitation

Matmul

  • Compute type for both matmul ops must be float.

  • Input tensors must have data type FP16 or BF16.

  • Output tensors must have data type FP16, BF16, or FP32 (TF32).

Pointwise operations in g 5, g 6, and g 7

Compute type must be FP32 (TF32).

Reduction operations in g 5, g 6, and g 7

I/O types and compute types must be FP32 (TF32).

Layout requirements of Mha-bprop fusions include:

  • All I/O tensors must have 4 dimensions, with the first two denoting the batch dimensions. The usage of rank-4 tensors in matmul ops can be read from the Backend Descriptor Types documentation.

  • The contracting dimension (dimension K) for the second matmul must be 64.

  • The contracting dimension (dimension K) for the first, second, and third matmul must be less than or equal to 512 and a multiple of 64.

  • The last dimension (corresponding to hidden dimensions) in Q, K, V, O, and dO is expected to have stride 1.

  • The S and dP tensors are expected to have CUDNN_ATTR_TENSOR_REORDERING_MODE set to CUDNN_TENSOR_REORDERING_F16x16.

Fused Flash Attention fprop

cuDNN supports flash fused attention to perform scale dot product attention commonly used in models like GPT, BERT, and so on. The general pattern supported by this engine is BMM-Softmax-BMM with many other optional features that you can opt into. You can choose to create the graph by yourself or use the custom sdpa node in cuDNN frontend. Using the frontend node will make opting into the different options like causal mask, dropout, alibi masking, and so on, very easy.

Snapshot of support of Fused Flash Attention ``fprop``

The K-cache and V-cache inputs can be non-virtual tensors, or can optionally be composed of paged cache load operations

Pre-softmax optional DAGs cover multiple options for the users to configure:

  • pointwise Multiply node for attention scale after the first matmul

  • pointwise Add node for the relative positional encoding to add a bias after the first matmul

  • Different masking options like causal masking, padding masking, sliding window attention, and alibi masking. Users can choose multiple masking schemes together or no masking.

  • pointwise Multiply node that accepts a full tensor that can be used as a custom mask generated by the user

  • pointwise nodes that represent activation functions like CUDNN_POINTWISE_TANH_FWD

Post-softmax optional DAGs cover multiple options for the users to configure:

  • pointwise Multiply node with a RNG node to signify dropout

  • pointwise Multiply node with a user generated tensor acting as the dropout mask

All these DAGs are optional. A user can enable them depending on the cuDNN API they are targeting. If using the sdpa node in cuDNN frontend, they can set the provided API options to true, for example use_causal_mask(True) and internally, the frontend will add the correct graph automatically. While using the graph API directly, users can add the corresponding graph of the operations they want into the cuDNN graph.

The compound operations for example: Causal Mask, Sliding Window Mask, Softmax, and so on, can be represented using the following operation graphs in cuDNN.

Causal Mask

Flash ``fprop`` Causal Mask Operation Graph

Padding Mask

Flash ``fprop`` Padding Mask Operation Graph

Sliding Window Mask

Flash ``fprop`` Sliding Window Mask Operation Graph

Alibi Mask

Flash ``fprop`` Alibi Mask Operation Graph

Softmax

Flash fprop Softmax Operation Graph

Dropout

Flash fprop Dropout Operation Graph

Paged KV caches

Flash fprop Paged Attention Operation Graph
Limitations For The Input And The Output Non-Virtual Tensors

Limitation

Q, K^T, and V tensor

  • All tensors must be either FP16 or BF16 data type.

  • Contracting dimension for Q must be a multiple of 8 with a maximum value of 128 for Ampere GPUs and 256 for Hopper GPUs.

  • Non-contracting dimension for V must be a multiple of 8 with a maximum value of 128 for Ampere GPUs and 256 for Hopper GPUs.

  • Contracting dimension for Q and K^T needs to have stride 1 in the layout.

  • Non-contracting dimension for V needs to have stride 1 in the layout.

  • The second dimension in K^T corresponding to the number of heads can be a factor of the number of heads of Q.

  • The second dimension in V corresponding to the number of heads can be a factor of the number of heads of Q.

Tensors for paged attention: container_K, container_V, page_table_K, page_table_v

  • Both containers have [b,h,block_size,d] dimensions.

  • block_size must be a power of two.

  • Both page_tables have [b,1,ceil(s_kv/block_size),1] dimensions, where s_kv is the maximum sequence size in the associated container.

  • Both page_tables must have the INT32 data type.

Softmax stats

  • Data type must be FP32.

  • Data must be in row major format.

O tensor

  • Data type must be either FP16 or BF16.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

Seed and Offset

INT32 or INT64 scalar in host or GPU

Scale to Pointwise

Attention scale can be FP16/BF16/FP32.

Inference mode can be turned on by passing the Softmax stats as a virtual tensor and setting the RNG node probability to 0.0f. The pattern is supported for GPUs with NVIDIA Ampere architecture and newer.

cuDNN also supports Fused Flash Attention in FP8 data type supported on NVIDIA Hopper GPUs. In addition to the standard fprop graph, there are additional dequantization scales, quantization scales, and absolute max (amax) calculations. The current FP8 support is a subset of the features supported in BF16 support. We are actively working on expanding the support for the FP8 kernels.

Due to the limited numerical precision of FP8 data type, for practical use cases, you must scale values computed in FP32 format before storing them in FP8 format, and descale the values stored in FP8 format before performing computations on them. For more information, refer to the Transformer Engine FP8 primer.

Scaling and Descaling

In the context of FP8, scaling refers to multiplying each element of a FP32 tensor by a quantization factor.

The quantization factor is computed as: (Max representable value in the FP8 format) / (Max absolute value seen in the tensor).

For the E4M3 format, the quantization factor is 448.f/ tensor_amax (rounded to the nearest lower power of two).

For the E5M2 format, the quantization factor is 57344.f / tensor_amax (rounded to the nearest lower power of two).

The dequantization factor is the reciprocal of the quantization factor.

The meaning behind scaling is to spawn the full range of the FP8 format when computing on FP8 values and storing FP8 values, thereby, minimizing the precision loss. True values in FP32 format are multiplied by the quantization factor before storing them as scaled values in FP8 format. Computations on scaled values in FP8 format are descaled by multiplying with the dequantization factor to convert them back to their true values in FP32 format.

Scaling and descaling are critical for convergence with the FP8 data type, hence cuDNN only supports graph patterns for FP8 fused attention with the scaling and descaling nodes present.

In the following diagram, red tensors indicate FP8 datatype tensors and black tensors are in FP32 datatype.

FP8 Flash fprop Operation Graph

Pre-softmax optional DAGs cover multiple options for you to configure:

  • pointwise Multiply node for attention scale after the first matmul

  • Masking options include causal masking and no masking

Post-softmax optional DAGs cover multiple options for you to configure:

  • Currently there is no support for dropout

Limitations For The Input And The Output Non-Virtual Tensors for FP8 Flash Attention

Limitation

Q, K^T, and V tensor

  • All tensors must be either E4M3 or E5M2 data type.

  • Contracting dimension for Q must be a multiple of 16 with maximum value of 256.

  • Non-contracting dimension for V must be a multiple of 16 with maximum value of 256.

  • Contracting dimension for Q and K^T needs to have stride 1 in the layout.

  • Non-contracting dimension for V needs to have stride 1 in the layout.

  • The second dimension in K^T corresponding to the number of heads can be a factor of the number of heads of Q.

  • The second dimension in V corresponding to the number of heads can be a factor of the number of heads of Q.

Softmax stats

  • Data type must be FP32.

  • Data must be in row major format.

O tensor

  • Data type must be either E4M3 or E5M2.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

Scale to Pointwise

Attention scale can be FP32.

Dequantization scales (DeScale Q, DeScale K, DeScale V, DeScale S) and Quantization scales (ScaleS, ScaleO)

  • Data type must be FP32.

  • Scalar values with dimension [1,1,1,1] and stride [1,1,1,1]

  • Allowed to be on both host or GPU

Amax values (Amax_O, Amax_S)

  • Data type must be FP32.

  • Scalar values with dimension [1,1,1,1] and stride [1,1,1,1]

  • GPU tensor

We recommend using the cuDNN frontend scaled dot product attention nodes for cuDNN fused flash attention kernels. The following samples for the cuDNN frontend are available: - Attention Python samples - Attention C++ samples

For more information about cuDNN frontend scaled dot product attention, refer to Attention in the NVIDIA cudnn-frontend project documentation.

Fused Flash Attention bprop

cuDNN supports the corresponding backpropagation graph for fused flash attention. This can be used together with the fprop graph to perform training on Large Language Models (LLMs).

All the options mentioned in fprop are applicable in the bprop graph as well. The corresponding bprop frontend node contains the same options and can be configured to do the bprop. Users opting in for the graph API, again need to add the graphs of the operations they want. The graph shown below is for a standard attention layer in GPT with causal masking.

Note

The bprop support for activation functions has not been added to NVIDIA Ampere GPUs; it only exists on NVIDIA Hopper GPUs.

For Grouped Query Attention (GQA) and Multi Query Attention (MQA), you can configure an additional reduction node for dK and dV, which reduces the tensor from the full number of heads (Q heads) to the actual K and V heads.

For the input and output tensors, the limitations from the fprop graph are carried over. For the bprop specific tensors, the limitations are as follows:

Limitations For The bprop Specific Tensors

Limitation

dQ, dK, and dV tensor

  • All tensors must be either FP16 or BF16 data type.

  • The last dimension corresponding to the hidden dim per head must be a multiple of 8 with a maximum value of 128 for Ampere GPUs and 256 for Hopper GPUs.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

Softmax sum

  • Data type must be FP32.

  • Data must be in row major format.

dO tensor

  • Data type must be either FP16 or BF16.

  • The last dimension corresponding to the hidden dim per head must be a multiple of 8 with a maximum value of 128 for Ampere GPUs and 256 for Hopper GPUs.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

  • The layout of the tensor is required to be the same as the O tensor.

dqAccum tensor

  • Data type must be FP32.

  • The tensor must be memset to zero before passing to cuDNN.

  • Data must be in row major format.

Flash bprop cuDNN Operation Graph

The pattern is supported for GPUs with NVIDIA Ampere architecture and newer.

cuDNN also supports Fused Flash Attention bprop in native FP8 data type supported on NVIDIA Hopper GPUs. In addition to the standard bprop graph, there are additional dequantization scales, quantization scales, and absolute max (amax) calculations. The current FP8 Flash Attention bprop support is corresponding to the FP8 Flash Attention fprop support.

In the following diagram, red tensors indicate FP8 datatype tensors and black tensors are in FP32 datatype.

FP8 Flash bprop Operation Graph
Limitations For The FP8 Fused Flash Attention bprop Specific Tensors

Limitation

dQ, dK, and dV tensor

  • All tensors must be either E4M3 or E5M2 data type.

  • The last dimension corresponding to the hidden dim per head must be 128.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

dO tensor

  • Data type must be either E4M3 or E5M2.

  • The last dimension corresponding to the hidden dim per head must be 128.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

  • The layout of the tensor is required to be the same as the O tensor.

Specialized Pre-Compiled Engines

The pre-compiled specialized engines target and optimize for a specialized graph pattern with a ragged support surface. Because of this targeting, these graphs do not require runtime compilation.

In most cases, the specialized patterns are just special cases of the generic patterns used in the runtime fusion engines, but there are some cases where the specialized pattern does not fit any of the generic patterns. If your graph pattern matches a specialized pattern, you will get at least a pattern matching engine, and you might also get runtime fusion engines as another option.

Currently, the following patterns are supported by the pattern matching engines. Some nodes are optional. Optional nodes are indicated by dashed outlines.

ConvBNfprop

The ConvBNfprop pattern is illustrated in the following figure. Its restrictions and options include:

  • The three pointwise nodes scale, bias, and ReLU are optional.

  • X, Z, W, s 1, b :sub:1 must all be of FP16 data type.

  • Z needs to be of shape [N, C, H, W] with NHWC packed layout.

  • W needs to be of shape [K, C, R, S] with KRSC packed layout.

  • s 1, b :sub:1 need to be of shape [1, C, 1, 1] with NHWC packed layout.

  • Only ReLU activation is supported.

  • All of the intermediate tensors need to be virtual, except Y needs to be non-virtual.

  • I/O pointers should be 16 bytes aligned.

  • This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).

  • On devices with compute capability >= 9.0, we only support two patterns:

    • the full pattern: scale + bias + ReLU + Conv + GenStats, and

    • the partial pattern: Conv + GenStats.

ConvBNfprop, A Pre-Compiled Engine, Fuses ConvolutionFwd and GenStats With Several Pointwise Operations

Skip connections are commonly observed in ResNet-like models. To support fusions in skip connections, we support a variant of the pattern above, the DBARCS pattern (short for Dual, Scale, Bias, Add, ReLU, Conv genStats). The limitations and options of the DBARCS pattern include:

  • The pointwise dual scale and dual bias nodes are either both present or not. This is indicated by the dashed block encircling the dual scale and dual bias nodes. In case both the nodes are missing, the dual_X tensor is directly fed as input to the add node.

  • The pointwise nodes scale, bias, add, and ReLU are required nodes.

  • Currently, only supported on Hopper GPUs.

  • For all the other data types, layout and virtualness restrictions of the ConvBNfprop pattern apply to this pattern as well.

  • dual_X, dual_scale, and dual_bias must all be of FP16 data type.

  • dual_scale and dual_bias must be of shape [1,C,1,1] with NHWC packed layout.

  • Intermediate outputs of the ReLU and Conv nodes: Relu_Y and Y are non-virtual. All the other intermediate outputs are virtual.

  • The weight tensor W for the convolution needs to be of shape [K,C,1,1]. Only 1x1 filters with padding 0 are supported for the convolution in the DBARCS pattern.

DBARCS In The convBNfprop Series For Supporting Fusions Across Skip Connections

ConvBNwgrad

The ConvBNwgrad pattern is illustrated in the following figure. Its restrictions and options include:

  • The three pointwise operations are all optional, as indicated by the dashed outlines.

  • Only ReLU activation is supported.

  • X, s 1, b :sub:1, and dy must all be of FP16 datatype.

  • I/O pointers should be 16 bytes aligned.

  • X, s 1, b :sub:1, and dy must all have NHWC packed layouts.

  • All the intermediate tensors need to be virtual.

  • This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).

  • On devices with compute capability >= 9.0, support is restricted to:

    • the full pattern: scale + bias + ReLU + wgrad.

ConvBNwgrad, A Pre-Compiled Engine, Fuses ConvolutionBwdFilter With Several (Optional) Pointwise Operations

ConvBiasAct

The ConvBiasAct pattern is illustrated in the following figure. Its restrictions and options include:

  • \(\alpha_{1}\) and \(\alpha_{2}\) need to be scalars.

  • The activation node is optional.

  • The size of the bias tensor should be [1, K, 1, 1].

  • Internal conversions are not supported. That is, the virtual output between nodes needs to have the same data type as the nodes compute type, which should be the same as the epilog type of the convolution node.

  • There are some restrictions on the supported combination of data types, which can be found in the API Reference (refer to cudnnConvolutionBiasActivationForward()).

ConvBiasAct, A Pre-Compiled Engine, Fuses ConvolutionFwd With Several Pointwise Operations

ConvScaleBiasAct

The ConvScaleBiasAct pattern is illustrated in the following figure. Its restrictions and options include:

  • \(\alpha_{1}\), \(\alpha_{2}\), and \(b_{1}\) should have the same data type/layout and can only be FP32.

  • X, W, and Z can only be INT8x4 or INT8x32.

  • The size of the bias tensor should be [1, K, 1, 1].

  • Internal conversions are not supported. That is, the virtual output between nodes needs to be the same as their compute type.

  • Currently, Pointwise:ReLU is the only optional pointwise node.

ConvScaleBiasAct, A Pre-Compiled Engine

This pattern is very similar as ConvBiasAct. The difference is that here, the scales \(\alpha_{1}\) and \(\alpha_{2}\) are tensors, not scalars. If they are scalars, this pattern becomes a normal ConvBiasAct.

DgradDreluBNBwdWeight

The DgradDreluBNBwdWeight pattern is illustrated in the following figure. Its restrictions and options include:

  • Dgrad input dy and W are of FP16 datatypes.

  • Batch norm fwd inputs, X_bn is of FP16 datatype while the other tensors mean_bn, invstd_dev_bn, scale_bn, and bias_bn are FP32.

  • Outputs: dScale, dBias, A, B, C are of FP32 data type.

  • All pointers are 16 byte aligned.

  • This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).

DgradDreluBNBwdWeight is a pre-compiled engine that can be used in conjunction with the dBNApply pattern to compute the backwards path of batch norm.

DgradDreluBNBwdWeight Pattern For Fusions In The Backward Pass

The BNBwdWeight operation takes in five inputs: X_bn, mean_bn, invstddev_bn, scale_bn, and dy_bn (that is, the output from the ReLUBwd node).

It produces five outputs: gradients of the batch norm scale and bias params, dScale, dBias, and coefficients A, B, C. Note that for illustration purposes, the inputs are duplicated. The inputs on the left and right are however exactly the same.

This pattern is typically used in the computation of the Batch Norm Backward Pass.

When computing the backward pass of batch norm, dScale, dBias, and dX_bn are needed. The DgradDreluBnBwdWeight pattern computes the former two. Using the generated A, B, and C we can use the following dBNApply pattern to compute dX, the input gradient, as follows dx_bn = A*dy_bn + B*X_bn +C.

dBNApply Pattern For Final Gradient Computation

The dBNApply pattern was initially supported by a pre-compiled static engine but is now supported by the generic runtime fusion engine.

Note that the DgradDreluBNBwdWeight pattern is used in combination with the forward pass pattern ConvBNfprop. Because of performance reasons, the output of batch norm Y_bn, which was calculated in ConvBNfprop (output of scale-bias), needs to be recalculated by DgradDreluBnBwdWeight. The pointwise add node subtracts mean_bn from X_bn, hence the alpha2 parameter for that node should be set to -1.

FP8 Fused Flash Attention

cuDNN supports fused flash attention with input and output data types being in FP8 format through a pre-compiled engine but with limited shape support and maximum sequence length allowed up to 512. Our general guidance is to use the specialized Fused Flash Attention fprop and Fused Flash Attention bprop runtime fusion engines for FP8 datatype support.

Mapping with Backend Descriptors

For readability, the operations used in this section are abbreviated. The mapping with the actual backend descriptors can be found in this table:

Notations and Backend Descriptors

Notations Used In This Section

Backend Descriptor

Pointwise:scale

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_MUL and with operand B broadcasting into operand X

Pointwise:bias

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_ADD and with operand B broadcasting into operand X

Pointwise:add

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_ADD and with operand B with same dimensions as X

Pointwise:mul

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_MUL and with operand B with same dimensions as X

Pointwise:ReLU

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_RELU_FWD

Pointwise:ReLUBwd

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_RELU_BWD

Pointwise:tanh

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_TANH_FWD

Pointwise:sigmoid

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_SIGMOID_FWD

Pointwise:ELU

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_ELU_FWD

Pointwise:{ReLU,tanh,sigmoid,ELU}

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with one of the following modes: CUDNN_POINTWISE_RELU_FWD, CUDNN_POINTWISE_TANH_FWD, CUDNN_POINTWISE_SIGMOID_FWD, CUDNN_POINTWISE_ELU_FWD

matmul

CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR

ConvolutionFwd

CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR

ConvolutionBwdFilter

CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR

ConvolutionBwdData

CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR

GenStats

CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR

ResampleFwd

CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR

Reduction

CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR

BnBwdWeight

CUDNN_BACKEND_OPERATION_BN_BWD_WEIGHTS_DESCRIPTOR

NormForward

CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR

NormBackward

CUDNN_BACKEND_OPERATION_NORM_BACKWARD_DESCRIPTOR

BOOLEAN / packed-BOOLEAN

CUDNN_DATA_BOOLEAN: As described in the cuDNN API Reference, this type implies that eight boolean values are packed in a single byte, with the lowest index on the right (that is, least significant bit). packed-BOOLEAN and BOOLEAN are used interchangeably, where the former is used to emphasize and remind the user about the semantics.

INT8

CUDNN_DATA_INT8

FP8

CUDNN_DATA_FP8_E4M3 or CUDNN_DATA_FP8_E5M2

FP16

CUDNN_DATA_HALF

BF16

CUDNN_DATA_BFLOAT16

FP32

CUDNN_DATA_FLOAT

TF32

A tensor core operation mode used to accelerate floating point convolutions or matmuls. This can be used for an operation with compute type CUDNN_DATA_FLOAT, on NVIDIA Ampere architecture or later and be disabled with NVIDIA_TF32_OVERRIDE=1.