Graph API

The cuDNN library provides a declarative programming model for describing computation as a graph of operations. This graph API was introduced in cuDNN 8.0 to provide a more flexible API, especially with the growing importance of operation fusion.

The user starts by building a graph of operations. At a high level, the user is describing a dataflow graph of operations on tensors. Given a finalized graph, the user then selects and configures an engine that can execute that graph. There are several methods for selecting and configuring engines, which have tradeoffs with respect to ease-of-use, runtime overhead, and engine performance.

The graph API has two entry points:

We expect that most users prefer the cuDNN frontend API because:

  • It is less verbose without loss of control - all functionality accessible through the backend API is also accessible through the frontend API.

  • It adds functionality on top of the backend API, like errata filters and autotuning.

  • It is open source.

In either case (that is, the backend or frontend API), the high level concepts are the same.

Key Concepts

As mentioned previously, the key concepts in the graph API are:

These concepts are covered in the subsections below. Later we’ll go through an example to tie them all together.

Operations and Operation Graphs

An operation graph is a dataflow graph of operations on tensors. It is meant to be a mathematical specification and is decoupled from the underlying engines that can implement it, as there may be more than one engine available for a given graph.

I/O tensors connect the operations implicitly, for example, an operation A may produce a tensor X, which is then consumed by operation B, implying that operation B depends on operation A.

Engines and Engine Configurations

For a given operation graph, there are some number of engines that are candidates for implementing that graph. The typical way to query for a list of candidate engines is through a heuristics query, covered below.

An engine has knobs for configuring properties of the engine, like tile size (refer to cudnnBackendKnobType_t).

Heuristics

A heuristic is a way to get a list of engine configurations that are intended to be sorted from the most performant to least performant for the given operation graph. There are three modes:

  • CUDNN_HEUR_MODE_A - intended to be fast and be able to handle most operation graph patterns. It returns a list of engine configs ranked by the expected performance.

  • CUDNN_HEUR_MODE_B - intended to be more generally accurate than mode A, but with the tradeoff of higher CPU latency to return the list of engine configs. The underlying implementation may fall back to the mode A heuristic in cases where we know mode A can do better.

  • CUDNN_HEUR_MODE_FALLBACK - intended to be fast and provide functional fallbacks without expectation of optimal performance.

The recommended workflow is to query either mode A or B and check for support. The first engine config with support is expected to have the best performance.

You can “auto-tune”, that is, iterate over the list and time for each engine config and choose the best one for a particular problem on a particular device. The cuDNN frontend API provides a convenient function, cudnnFindPlan(), which does this.

If all the engine configs are not supported, then use the mode fallback to find the functional fallbacks.

Expert users may also want to filter engine configs based on properties of the engine, such as numerical notes, behavior notes, or adjustable knobs. Numerical notes inform the user about the numerical properties of the engine such as whether it does datatype down conversion at the input or during output reduction. The behavior notes can signal something about the underlying implementation like whether or not it uses runtime compilation. The adjustable knobs allow fine grained control of the engine’s behavior and performance.

Graph API Example with Operation Fusion

The following example implements a fusion of convolution, bias, and activation.

Creating Operation and Tensor Descriptors to Specify the Graph Dataflow

First, create three cuDNN backend operation descriptors.

In the following figure, the user specified one forward convolution operation (using CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR), a pointwise operation for the bias addition (using CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_ADD), and a pointwise operation for the ReLU activation (using CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_RELU_FWD). Refer to the Backend Descriptor Types for more details on setting the attributes of these descriptors. For an example of how a forward convolution can be set up, refer to the Setting Up An Operation Graph For A Grouped Convolution use case.

You should also create tensor descriptors for the inputs and outputs of all of the operations in the graph. The graph dataflow is implied by the assignment of tensors, for example, by specifying the backend tensor Tmp0 as both the output of the convolution operation and the input of the bias operation, cuDNN infers that the dataflow runs from the convolution into the bias. The same applies to tensor Tmp1. If the user doesn’t need the intermediate results Tmp0 and Tmp1 for any other use, then the user can specify them to be virtual tensors, so the memory I/Os can later be optimized out.

  • Graphs with more than one operation node do not support in-place operations (that is, where any of the input UIDs matches any of the output UIDs). Such in-place operations are considered cyclic in later graph analysis and deemed unsupported. In-place operations are supported for single-node graphs.

  • The operation descriptors can be created and passed into cuDNN in any order, as the tensor UIDs are enough to determine the dependencies in the graph.

A Set of Operation Descriptors the User Passes to the Operation Graph

Finalizing The Operation Graph

Second, the user finalizes the operation graph. As part of finalization, cuDNN performs the dataflow analysis to establish the dependency relationship between operations and connect the edges, as illustrated in the following figure. In this step, cuDNN performs various checks to confirm the validity of the graph.

The Operation Graph after Finalization

Configuring An Engine That Can Execute The Operation Graph

Third, given the finalized operation graph, the user must select and configure an engine to execute that graph, which results in an execution plan. As mentioned in Heuristics, the typical way to do this is:

  1. Query heuristics mode A or B.

  2. Look for the first engine config with functional support (or auto-tune all the engine configs with functional support).

  3. If no engine config was found in (2), try querying the fallback heuristic for more options.

Executing The Engine


Finally, with the execution plan constructed and when it comes time to run it, the user should construct the backend variant pack by providing the workspace pointer, an array of UIDs, and an array of device pointers. The UIDs and the pointers should be in the corresponding order. With the handle, the execution plan and variant pack, the execution API can be called and the computation is carried out on the GPU.

Supported Graph Patterns

The cuDNN Graph API supports a set of graph patterns. These patterns are supported by a large number of engines, each with their own support surfaces. These engines are grouped into four different classes, as reflected by the following four subsections: pre-compiled single operation engines, generic runtime fusion engines, specialized runtime fusion engines, and specialized pre-compiled fusion engines. The specialized engines, whether they use runtime compilation or pre-compilation, are targeted to a set of important use cases, and thus have a fairly limited set of patterns they currently support. Over time, we expect to support more of those use cases with the generic runtime fusion engines, whenever practical.

Since these engines have some overlap in the patterns they support, a given pattern may result in zero, one, or more engines.

Pre-Compiled Single Operation Engines

One basic class of engines includes pre-compiled engines that support an operation graph with just one operation; specifically: ConvolutionFwd, ConvolutionBwFilter, ConvolutionBwData, or ConvolutionBwBias. Their more precise support surface can be found in the NVIDIA cuDNN API Reference.

ConvolutionBwData

ConvolutionBwData computes the convolution data gradient of the tensor dy. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionBackwardData().

ConvolutionBwData Engine

ConvolutionBwFilter

ConvolutionBwFilter computes the convolution filter gradient of the tensor dy. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionBackwardFilter().

ConvolutionBwFilter Engine

ConvolutionFwd

ConvolutionFwd computes the convolution of X with filter data W. In addition, it uses scaling factors ɑ and ꞵ to blend this result with the previous output. This graph operation is similar to cudnnConvolutionForward().

ConvolutionBwFilter Engine

NormalizationBackward

NormalizationBackward computes the gradient dX and the scale and bias gradients dScale and dBias. This operation supports multiple modes which are set by the attribute CUDNN_ATTR_OPERATION_NORM_BWD_MODE. The precompiled engines support instance and layer norm backward while batch norm backward is supported by a specialized runtime compiled engine (refer to DReluForkDBn). The mean and variance saved during the forward training pass is passed as input to the NormBackward operation.

NormalizationBackward Engine
Instance Norm, Layer Norm, and RMS Norm For NormalizationBackward

Node and Other Attributes

Instance Normalization Backward

Layer Normalization Backward

RMS Normalization Backward

operation

normBwd

normBwd

normBwd

X

[N, C, (D), H, W], input, I type

[N, C, (D), H, W], input, I type

[N, C, (D), H, W], input, I type

Mean

[N,C,(1),1,1], input, compute type

[N,1,(1),1,1], input, compute type

N/A

InvVariance

[N,C,(1),1,1], input, compute type

[N,1,(1),1,1], input, compute type

[N,1,(1),1,1], input, compute type

Scale

[1,C,(1),1,1], input weight, W type

[1,C,(D),H,W], input weight, W type

[1,C,(D),H,W], input weight, W type

DY

[N, C, (D), H, W], input, O type

[N, C, (D), H, W], input, O type

[N, C, (D), H, W], input, O type

DX

[N, C, (D), H, W], output, I type

[N, C, (D), H, W], output, I type

[N, C, (D), H, W], output, I type

Dscale

[1,C,(1),1,1], output, W type

[1,C,(D),H,W], output, W type

[1,C,(D),H,W], output, W type

Dbias

[1,C,(1),1,1], output, W type

[1,C,(D),H,W], output, W type

Optional

mode

CUDNN_INSTANCE_NORM

CUDNN_LAYER_NORM

CUDNN_RMS_NORM

Supported layout

NC(D)HW, N(D)HWC

NC(D)HW, N(D)HWC

NC(D)HW, N(D)HWC

Supported I and O types

FP16, FP32, BF16

FP16, FP32, BF16

FP16, FP32, BF16

Supported compute type

FP32

FP32

FP32

Supported W types

FP32

FP16, FP32, BF16

FP16, FP32, BF16

Alignment requirements for I/O type

8 bytes aligned

16 bytes aligned

16 bytes aligned

For each operation, all applicable tensors must have the same layout. Neither mixed I/O types, nor mixed compute types are supported.

Layer Norm and RMS Norm are also supported by specialized runtime compiled engines, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 for normalization backward operations. For these more performant engines, sizeof(Itype) >= sizeof(Otype) restriction applies.

NormalizationForward

NormalizationForward computes the normalization output Y from the input X. This operation is used in both the inference and training phase. The phases are distinguished by the attribute CUDNN_ATTR_OPERATION_NORM_FWD_PHASE.

NormalizationForward Engine

This operation supports different normalization modes which are set by the attribute CUDNN_ATTR_OPERATION_NORM_FWD_MODE. The dashed lines indicate optional inputs, which are typically used in the batch norm mode of this operation. Currently, instance norm and layer norm are supported via both precompiled and specialized runtime compiled engines while batch norm and RMS norm are supported using specialized runtime compiled engines (refer to BnAddRelu).

Instance Norm, Layer Norm, and RMS Norm For NormalizationForward

Node and Other Attributes

Instance Normalization Forward

Layer Normalization Forward

RMS Normalization Forward

operation

normFwd

normFwd

normFwd

X

[N, C, (D), H, W], input, I type

[N, C, (D), H, W], input, I type

[N, C, (D), H, W], input, I type

Mean

[N,C,(1),1,1], output, compute type, only applicable to fmode CUDNN_NORM_FWD_TRAINING

[N,1,(1),1,1], output, compute type, only applicable to fmode CUDNN_NORM_FWD_TRAINING

N/A

InvVariance

[N,C,(1),1,1], output, compute type, only applicable to fmode CUDNN_NORM_FWD_TRAINING

[N,1,(1),1,1], output, compute type, only applicable to fmode CUDNN_NORM_FWD_TRAINING

[N,1,(1),1,1], output, compute type,only applicable to fmode CUDNN_NORM_FWD_TRAINING

Scale

[1,C,(1),1,1], input weight, W type

[1,C,(D),H,W], input weight, W type

[1,C,(D),H,W], input weight, W type

Bias

[1,C,(1),1,1], input weight, W type

[1,C,(D),H,W], input weight, W type

Optional (no bias by default)

Y

[N, C, (D), H, W], output, O type

[N, C, (D), H, W], output, O type

[N, C, (D), H, W], output, O type

epsilonDesc

[1,1,1,1], input, constant

[1,1,1,1], input, constant

[1,1,1,1], input, constant

mode

CUDNN_INSTANCE_NORM

CUDNN_LAYER_NORM

CUDNN_RMS_NORM

Supported fmode

CUDNN_NORM_FWD_TRAINING, CUDNN_NORM_FWD_INFERENCE

CUDNN_NORM_FWD_TRAINING, CUDNN_NORM_FWD_INFERENCE

CUDNN_NORM_FWD_TRAINING, CUDNN_NORM_FWD_INFERENCE

Supported layout

NC(D)HW, N(D)HWC

NC(D)HW, N(D)HWC

NC(D)HW, N(D)HWC

Supported I and O types

FP16, FP32, BF16

FP16, FP32, BF16

FP16, FP32, BF16

Supported compute type

FP32

FP32

FP32

Supported weight types, W type

FP32

FP16, FP32, BF16

FP16, FP32, BF16

Alignment requirements for I/O type

8 bytes aligned

16 bytes aligned

16 bytes aligned

For each operation, all applicable tensors must have the same layout.

Layer Norm and RMS Norm are also supported by specialized runtime compiled engines, CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 3 with fmode set to CUDNN_NORM_FWD_TRAINING and CUDNN_ATTR_ENGINE_GLOBAL_INDEX = 4 with fmode set to CUDNN_NORM_FWD_INFERENCE. For these more performant engines, sizeof(Itype) >= sizeof(Otype) restriction applies. For non-runtime compiled engines, W type is the compute type.

Generic Runtime Fusion Engines

The engines documented in the previous section support single-op patterns. Of course, for fusion to be interesting, the graph needs to support multiple operations. And ideally, we want the supported patterns to be flexible to cover a diverse set of use cases. To accomplish this generality, cuDNN has runtime fusion engines that generate the kernel (or kernels) at runtime based on the graph pattern. This section outlines the patterns supported by these runtime fusion engines (that is, engines with CUDNN_BEHAVIOR_NOTE_RUNTIME_COMPILATION behavioral note).

We can think of the support surface as covering the following generic patterns:

  1. MatMul fusions: \(g_{2}\left( C=matmul\left( A=g_{1A} \left( inputs \right), B=g_{1B} \right(inputs)), inputs \right)\)

  2. ConvolutionFwd fusions: \(g_{2}\left( Y=convolutionFwd\left( X=g_{1} \left( inputs \right), W\right), inputs \right)\)

  3. ConvolutionBwFilter fusions: \(g_{2}\left( dw=convolutionBwFiler\left( dy, X=g_{1} \right(inputs)), inputs \right)\)

  4. ConvolutionBwData fusions: \(g_{2}\left( dx=convolutionBwData\left( dy=g_{1} \left( inputs \right), W \right), inputs \right)\)

  5. Pointwise fusions: \(g_{2}\left( inputs \right)\)

Graphical Representation of the Generic Patterns Supported by the Runtime Fusion Engines

Note

  • g 1 (including g 1A and g 1B) indicates fusion operations applied to the inputs of the MatMul and Convolution operation.

  • g 2 indicates fusion operations apply to the output of the MatMul and Convolution operation.

  • g 2 can have more than one output.

  • The fusion patterns in g 1 will be referred to as the mainloop fusion, and the fusion patterns in g 2 will be referred to as the epilogue fusion.

  • The arrow going into g 2 can go into any of g 2 nodes and does not necessarily need to feed into a root node.

  • The abbreviated notations for operations are used in the diagrams and throughout the text for visualization purposes.

Support Surface

The generic runtime fusion engine includes three independent support surfaces indexed as 90, 80, and 70. Once the cuDNN graph constructed fully fulfills the requirements of at least one support surface, the graph will be able to be executed by the generic runtime fusion engine. The table lists a summary of the features in each support surface. Generally, we recommend users to build cuDNN graphs targeting the support surface with larger index and use the following support surfaces as fallback plan.

Summary of Supported Features of Each Support Surface

Feature

Support Surface 90

Support Surface 80

Support Surface 70

Compute Capability

>= 9.0

>= 8.0

>= 7.0

MatMul Fusions

Supported

Supported

Supported

ConvolutionFwd Fusions

Supported

Supported

Supported

ConvolutionBwFilter Fusions

Supported

Not Supported

Supported

ConvolutionBwData Fusions

Partially Supported

Not Supported

Supported

Pointwise and Reduction Fusions

Not Supported

Not Supported

Supported

FP8 MatMul and Convolution Operations

Supported

Supported for Compute Capability >= 8.9

Not Supported

g 1 (Mainloop) Fusions

Supported

Supported

Partially Supported

g 2 (Epilogue) Fusions

Supported

Supported

Supported

Mixed Input GEMM

Supported

Supported

Not Supported

Grouped Convolution

Supported

Supported

Not Supported

The detailed supported features of each support surface are listed in the following subsections, respectively.

Support Surface 90

Compute Capability

  • NVIDIA GPUs with compute capability 9.0 are supported.

Generic Limitations

  • Strided ConvolutionBwData fusions are not supported.

  • Pointwise and Reduction fusions are not supported.

Advanced Supports

  • Mixed input MatMul, ConvolutionFwd, and ConvolutionBwData fusions are supported.

  • Grouped ConvolutionFwd, ConvolutionBwFilter, and ConvolutionBwData fusions are supported.

I/O and Intermediate Data Type

  • The input tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.

  • The input tensor data type of MatMul, ConvolutionFwd, ConvolutionBwFilter, and ConvolutionBwData operations can be any of {FLOAT, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.

  • The output tensor data type can be any of {INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2}.

  • The output data types of CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations can only be FLOAT.

  • The intermediate virtual tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}, and this intermediate storage type is obeyed by the code-generator. Generally, FP32 is recommended.

Compute Data Type

  • Compute data type can be either FP32 or INT32 for CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.

  • Compute data type can only be FP32 for CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations.

  • The support surface of the compute data type of MatMul, ConvolutionFwd, ConvolutionBwFilter, and ConvolutionBwData operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.

Combinatory Support Surface of Input Data Type and Compute Data Type of MatMul and Convolution Operations

MatMul / Convolution Operation Input Data Type

MatMul / Convolution Operation Compute Data Type

INT8

INT32

FP8_E4M3, FP8_E5M2

FLOAT, FAST_FLOAT_FOR_FP8

HALF

FLOAT, HALF

BFLOAT16

FLOAT

FLOAT

FLOAT

Mainloop Fusions: g :sub:`1`

  • g 1 is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.

  • All the input tensors must have an alignment of 128 bits. For grouped ConvolutionFwd, ConvolutionBwFilter, and ConvolutionBwData fusions, the alignment requirement is per group.

  • All the intermediate tensors must be virtual.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 1

Pattern

Dimension

Layout

MatMul fusions

  • Tensor A must have dimension dim[B, M, K] or dim[1, M, K].

  • Dimensions of input tensors to g 1A can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, K], dim[B, 1, K], dim[1, M, K], or dim[B, M, K].

  • Tensor B must have dimension dim[B, K, N] or dim[1, K, N].

  • Dimensions of input tensors to g 1B can be dim[1, 1, 1], dim[B, 1, 1], dim[1, 1, N], dim[B, 1, N], dim[1, K, 1], dim[B, K, 1], dim[1, K, N], or dim[B, K, N].

All tensor layouts can be either row-major or column-major, and do not need to be the same.

ConvolutionFwd fusions

  • Tensor X must have dimension dim[N, C, (D,) H, W].

  • Dimensions of tensors operating with X can be dim[1, 1, (1,) 1, 1], dim[1, C, (1,) 1, 1], or dim[N, C, (D), H, W]. Exception: dim[1, C, (1,) 1, 1] is not compatible in grouped ConvolutionFwd fusions.

  • Tensor W must have dimension dim[K, C, (T,) R, S].

All tensors must be fully packed NHWC layout.

ConvolutionBwFilter fusions

  • Tensor dy must have dimension dim[N, K, (O,) P, Q].

  • Tensor X must have dimension dim[N, C, (D,) H, W].

  • Fusion operation in g 1 is not supported.

  • When the input data type of the ConvolutionBwFilter operation is HALF or BFLOAT16, all tensors must be fully packed NHWC layout.

  • When the input data type of the ConvolutionBwFilter operation is FLOAT, INT8, FP8_E4M3, or FP8_E4M3, all tensors must be fully packed CHWN layout.

ConvolutionBwData fusions

  • Tensor dy must have dimension dim[N, K, (O,) P, Q].

  • Dimensions of tensors operating with dy can be dim[1, 1, (1,) 1, 1], dim[1, K, (1,) 1, 1], or dim[N, K, (O,) P, Q].

  • Tensor W must have dimension dim[K, C, (T,) R, S].

  • When the input data type of the ConvolutionBwFilter operation is HALF or BFLOAT16, all tensors must be fully packed NHWC layout.

  • When the input data type of the ConvolutionBwFilter operation is FLOAT, INT8, FP8_E4M3, or FP8_E4M3, all input tensors to g 1 must be fully packed NHWC layout, and W tensor must be fully packed CHWN layout.

Epilogue Fusion: g :sub:`2`

  • g 2 is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation and zero or one CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.

  • CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation can only be the exit node of g 2.

  • All the input and output tensors must have an alignment of 32 bits. For grouped ConvolutionFwd, ConvolutionBwFilter, and ConvolutionBwData fusions, the alignment requirement is per group.

  • In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 2

Pattern

Dimension

Layout

MatMul fusions

  • Tensor C must have dimension dim[B, M, N].

  • Dimensions of other input tensors to g 2 can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, N], dim[B, 1, N], dim[1, M, N], or dim[B, M, N].

  • Dimensions of output tensors from g 2 can be dim[B, M, N].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, N], dim[B, 1, N], or dim[1, M, N].

  • All tensor layouts can be either row-major or column-major, but need to be the same.

  • If CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation presents, all tensor layouts must be row-major.

ConvolutionFwd fusions

  • Tensor Y must have dimension dim[N, K, (O,) P, Q].

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], dim[1, K, (1,) 1, 1], or dim[N, K, (O,) P, Q].

  • Dimensions of output tensors from g 2 can be dim[N, K, (O,) P, Q].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], dim[1, K, (1,) 1, 1], dim[N, K, (1,) 1, 1], or dim[N, 1, (1,) 1, 1].

  • Grouped ConvolutionFwd fusions cannot have CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation in g 2.

All tensors must be fully packed NHWC layout.

ConvolutionBwFilter fusions

  • Tensor dw must have dimension dim[K, C, (T,) R, S].

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[1, C, (T,) R, S], dim[K, 1, (1,) 1, 1], or dim[K, C, (T,) R, S].

  • Dimensions of output tensors from g 2 can be dim[K, C, (T,) R, S].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[1, C, (T,) R, S], or dim[K, 1, (1,) 1, 1].

  • Grouped ConvolutionBwFilter fusions cannot have CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation in g 2.

All tensors must be fully packed NHWC layout.

ConvolutionBwData fusions

  • Tensor dx must have dimension dim[N, C, (D,) H, W].

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[N, 1, (D,) H, W], dim[1, C, (1,) 1, 1], or dim[N, C, (D,) H, W].

  • Dimensions of output tensors from g 2 can be dim[N, C, (D,) H, W].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[N, 1, (D,) H, W], dim[1, C, (1,) 1, 1], dim[N, C, (1,) 1, 1], or dim[N, 1, (1,) 1, 1].

  • Grouped ConvolutionBwData fusions cannot have g 2.

All tensors must be fully packed NHWC layout.

Pointwise and Reduction fusions

Not Supported

Not Supported

Support Surface 80

Compute Capability

  • NVIDIA GPUs with compute capability 8.0, 8.6. 8.7, 8.9, and 9.0 are supported.

Generic Limitations and Advanced Supports

  • ConvolutionBwFilter fusions are not supported.

  • ConvolutionBwData fusions are not supported.

  • Pointwise and Reduction fusions are not supported.

Advanced Supports

  • Mixed input MatMul and ConvolutionFwd fusions are supported.

  • Grouped ConvolutionFwd fusions are supported.

I/O and Intermediate Data Type

  • The input tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.

  • The input tensor data type of MatMul and ConvolutionFwd operations can be any of {FLOAT, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}. FP8_E4M3 and FP8_E5M2 input tensor data type of MatMul and ConvolutionFwd operations are only available with compute capability 8.9.

  • The output tensor data type can be any of {INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2}.

  • The output data types of CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations can only be FLOAT.

  • The intermediate virtual tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}, and this intermediate storage type is obeyed by the code-generator. Generally, FP32 is recommended.

  • FP8_E4M3 and FP8_E5M2 input, output, and intermediate tensor data type is only available with compute capability 8.9 and 9.0.

Compute Data Type

  • Compute data type can be either FP32 or INT32 for CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.

  • Compute data type can only be FP32 for CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations.

  • The support surface of the compute data type of MatMul, ConvolutionFwd, ConvolutionBwFilter, and ConvolutionBwData operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.

Combinatory Support Surface of Input Data Type and Compute Data Type of MatMul and Convolution Operations

MatMul / Convolution Operation Input Data Type

MatMul / Convolution Operation Compute Data Type

Note

INT8

INT32

FP8_E4M3, FP8_E5M2

FLOAT, FAST_FLOAT_FOR_FP8

Available with compute capability 8.9 only

HALF

FLOAT, HALF

BFLOAT16

FLOAT

FLOAT

FLOAT

Mainloop Fusions: g :sub:`1`

  • g 1 is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation.

  • All the input tensors must have an alignment of 128 bits. For grouped ConvolutionFwd fusions, the alignment requirement is per group.

  • All the intermediate tensors must be virtual.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 1

Pattern

Dimension

Layout

MatMul fusions

  • Tensor A must have dimension dim[B, M, K] or dim[1, M, K].

  • Dimensions of input tensors to g 1A can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, K], dim[B, 1, K], dim[1, M, K], or dim[B, M, K].

  • Tensor B must have dimension dim[B, K, N] or dim[1, K, N].

  • Dimensions of input tensors to g 1B can be dim[1, 1, 1], dim[B, 1, 1], dim[1, 1, N], dim[B, 1, N], dim[1, K, 1], dim[B, K, 1], dim[1, K, N], or dim[B, K, N].

All tensor layouts can be either row-major or column-major, and do not need to be the same.

ConvolutionFwd fusions

  • Tensor X must have dimension dim[N, C, (D,) H, W].

  • Dimensions of tensors operating with X can be dim[1, 1, (1,) 1, 1], dim[1, C, (1,) 1, 1], or dim[N, C, (D), H, W]. Exception: dim[1, C, (1,) 1, 1] is not compatible in grouped ConvolutionFwd fusions.

  • Tensor W must have dimension dim[K, C, (T,) R, S].

All tensors must be fully packed NHWC layout.

ConvolutionBwFilter fusions

Not Supported

Not Supported

ConvolutionBwData fusions

Not Supported

Not Supported

Epilogue Fusion: g :sub:`2`

  • g 2 is a directed acyclic graph (DAG) that can consist of zero or any number of the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation and zero or one CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.

  • CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation can only be the exit node of g 2.

  • All the input and output tensors must have an alignment of 32 bits. For grouped ConvolutionFwd fusions, the alignment requirement is per group.

  • In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 2

Pattern

Dimension

Layout

MatMul fusions

  • Tensor C must have dimension dim[B, M, N].

  • Dimensions of other input tensors to g 2 can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, N], dim[B, 1, N], dim[1, M, N], or dim[B, M, N].

  • Dimensions of output tensors from g 2 can be dim[B, M, N].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, 1], dim[B, 1, 1], dim[1, M, 1], dim[B, M, 1], dim[1, 1, N], dim[B, 1, N], or dim[1, M, N].

  • All tensor layouts can be either row-major or column-major, but need to be the same.

  • If CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation presents, all tensor layouts must be row-major.

ConvolutionFwd fusions

  • Tensor Y must have dimension dim[N, K, (O,) P, Q].

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], dim[1, K, (1,) 1, 1], or dim[N, K, (O,) P, Q].

  • Dimensions of output tensors from g 2 can be dim[N, K, (O,) P, Q].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], dim[1, K, (1,) 1, 1], or dim[N, 1, (1,) 1, 1].

  • Grouped ConvolutionFwd fusions cannot have CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation in g 2.

All tensors must be fully packed NHWC layout.

ConvolutionBwFilter fusions

Not Supported

Not Supported

ConvolutionBwData fusions

Not Supported

Not Supported

Pointwise and Reduction fusions

Not Supported

Not Supported

Support Surface 70

Support Surface of Compute Capability

  • NVIDIA GPUs with compute capability 7.0, 7.2, 7.5, 8.0, 8.6, 8.7, 8.9, and 9.0 are supported.

I/O and Intermediate Data Type

  • The input tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2}.

  • The input tensor data type of MatMul, ConvolutionFwd, ConvolutionBwFilter, and ConvolutionBwData operations can be any of {FLOAT, HALF, BFLOAT16, INT8}.

  • The output tensor data type can be any of {INT64, FLOAT, INT32, HALF, BFLOAT16, INT8, UINT8, FP8_E4M3, FP8_E5M2, BOOLEAN}.

  • The output data types of CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations can only be FLOAT.

  • The intermediate virtual tensor data type can be any of {FLOAT, INT32, HALF, BFLOAT16, INT8, FP8_E4M3, FP8_E5M2, BOOLEAN}, and this intermediate storage type is obeyed by the code-generator. Generally, FP32 is recommended.

  • FP8_E4M3 and FP8_E5M2 data types are only allowed in pure Pointwise and Reduction fusions.

Compute Data Type

  • Compute data type can be FP32 or BOOLEAN for CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations.

  • Compute data type can only be FP32 for CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operations.

  • The support surface of the compute data type of MatMul, ConvolutionFwd, ConvolutionBwFilter, and ConvolutionBwData operations depends on the input data types of the operation. The combinatory support surface is listed in the following table.

Combinatory Support Surface of Input Data Type and Compute Data Type of MatMul and Convolution Operations

MatMul / Convolution Operation Input Data Type

MatMul / Convolution Operation Compute Data Type

Note

INT8

INT32

Not available for ConvolutionBwFilter and ConvolutionBwData fusions

HALF

FLOAT, HALF

Available with compute capability 8.9 only

HALF

FLOAT, HALF

BFLOAT16

FLOAT

FLOAT

FLOAT

Mainloop Fusions: g :sub:`1`

  • g 1 is a directed acyclic graph (DAG) that can consist of zero or any number of the following operations:

    • CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR

  • CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR or CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR operations, if present, should be before any Pointwise operations.

  • For compute capability < 8.0, g 1 is not supported.

  • All the input tensors must have an alignment of 32 bits.

  • All the intermediate tensors must be virtual.

  • In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.

  • CUDNN_BACKEND_OPERATION_CONCAT_DESCRIPTOR or CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR operations, if present, should be before any Pointwise operations.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 1

Pattern

Dimension

Layout

MatMul fusions

  • Tensor A must have dimension dim[B, M, K].

  • If g 1 presents, Tensor A must be in HALF data type, the other input tensors broadcasted and operated with tensor A can have any data type.

  • Dimensions of other input tensors to g 1A can be dim[1, 1, 1], dim[B, M, 1], dim[B, 1, K], or dim[B, M, K]. If the input tensor is in dimension dim[B, M, K], it must have data type HALF as well.

  • Tensor B must have dimension dim[B, K, N].

  • Fusion operation in g 1B is not supported.

  • All tensor layouts can be either row-major or column-major.

  • All input tensors to g 1A must have the same layout.

  • If the data type of the input tensor to the MatMul operation is INT8, all the input tensors to g 1A must be in row-major and the tensor B must be in column-major.

ConvolutionFwd fusions

  • Tensor X must have dimension dim[N, C, (D,) H, W].

  • Tensor W must have dimension dim[K, C, (T,) R, S].

  • Fusion operations on X tensor can be only a chain of three specific Pointwise operations, in this exact order: CUDNN_POINTWISE_MUL, CUDNN_POINTWISE_ADD, and CUDNN_POINTWISE_RELU_FWD. This specific support is added to realize convolution batch norm fusion use cases.

  • All tensors involved can only be HALF data type.

  • CUDNN_POINTWISE_MUL and CUDNN_POINTWISE_ADD can only be operated with a tensor with dimension dim[1, C, (1,) 1, 1].

All tensors must be fully packed NHWC layout.

ConvolutionBwFilter fusions

  • Tensor dy must have dimension dim[N, K, (O,) P, Q].

  • Tensor X must have dimension dim[N, C, (D,) H, W].

  • Fusion operations on X tensor can be only a chain of three specific Pointwise operations, in this exact order: CUDNN_POINTWISE_MUL, CUDNN_POINTWISE_ADD, and CUDNN_POINTWISE_RELU_FWD. This specific support is added to realize convolution batch norm fusion use cases.

  • All tensors involved can only be HALF data type.

  • CUDNN_POINTWISE_MUL and CUDNN_POINTWISE_ADD can only be operated with a tensor with dimension dim[1, C, (1,) 1, 1].

All tensors must be fully packed NHWC layout.

ConvolutionBwData fusions

  • Tensor dy must have dimension dim[N, K, (O,) P, Q].

  • Tensor W must have dimension dim[K, C, (T,) R, S].

  • Fusion operation in g 1 is not supported.

All tensors must be fully packed NHWC layout.

Epilogue Fusion: g :sub:`2`

  • g 2 is a directed acyclic graph (DAG) that can consist of zero or any number of the following operations:

    • CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR

    • CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR

and zero or one CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.

  • CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation can only be the exit node of g 2.

  • CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR operations, if present, must be the final nodes in g 2. Hence, CUDNN_BACKEND_OPERATION_SIGNAL_DESCRIPTOR operations cannot be used in conjunction with the CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation.

  • The input tensor to a CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR or CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR operation should not be produced by another operation within this graph, but should come from global memory. These two operations cannot be used in the MatMul, ConvolutionBwFilter, and ConvolutionBwData fusions, and are only supported with compute capability >= 7.5.

  • All the input and output tensors must have an alignment of 32 bits, except the output of a CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operation can have an alignment of 8 bits.

  • In CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR operations, the tensors being broadcasted cannot be placed as the first input.

  • The support surface of dimension and layout are listed in the table below.

Support Surface of Dimension and Layout for Tensors in g 2

Pattern

Dimension

Layout

MatMul fusions

  • Tensor C must have dimension dim[B, M, N].

  • Dimensions of other input tensors to g 2 can be dim[1, 1, 1], dim[B, M, 1], dim[B, 1, N], or dim[B, M, N].

  • Dimensions of output tensors from g 2 can be dim[B, M, N].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, 1], dim[B, M, 1], or dim[B, 1, N].

  • All tensor layouts can be either row-major or column-major, but need to be the same.

  • If CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation presents, all tensor layouts must be row-major.

ConvolutionFwd fusions

  • Tensor Y must have dimension dim[N, K, (O,) P, Q].

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], dim[1, K, (1,) 1, 1], or dim[N, K, (O,) P, Q].

  • Dimensions of output tensors from g 2 can be dim[N, K, (O,) P, Q].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[N, 1, (O,) P, Q], or dim[1, K, (1,) 1, 1].

All tensors must be fully packed NHWC layout.

ConvolutionBwFilter fusions

  • Tensor dw must have dimension dim[K, C, (T,) R, S].

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[1, C, (T,) R, S], dim[K, 1, (1,) 1, 1], or dim[K, C, (T,) R, S].

  • Dimensions of output tensors from g 2 can be dim[K, C, (T,) R, S].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[1, C, (T,) R, S], or dim[K, 1, (1,) 1, 1].

All tensors must be fully packed NHWC layout.

ConvolutionBwData fusions

  • Tensor dx must have dimension dim[N, C, (D,) H, W].

  • Dimensions of other input tensors to g 2 can be dim[1, 1, (1,) 1, 1], dim[N, 1, (D,) H, W], dim[1, C, (1,) 1, 1], or dim[N, C, (D,) H, W].

  • Dimensions of output tensors from g 2 can be dim[N, C, (D,) H, W].

  • If the last operation in g 2 is CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR, the dimension of the last output tensor can be dim[1, 1, (1,) 1, 1], dim[N, 1, (D,) H, W], or dim[1, C, (1,) 1, 1].

All tensors must be fully packed NHWC layout.

Pointwise and Reduction fusions

  • If all tensors are 3D, the same dimension requirements as MatMul g 2.

  • If all tensors are 4D or 5D, the same dimension requirements as ConvolutionFwd, ConvolutionBwFilter, and ConvolutionBwData g 2.

  • CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR operation does not support 3D tensors.

  • If all tensors are 3D, the same layout requirements as MatMul g 2.

  • If all tensors are 4D or 5D, the same layout requirements as ConvolutionFwd, ConvolutionBwFilter, and ConvolutionBwData g 2.

Operation Specific Constraints for the Runtime Fusion Engines

Every operation in the supported generic patterns of the runtime fusion engines is subject to a few specific constraints regarding their parameter surface. The following subsections document these.

Note that these constraints are in addition to (1) any constraints mentioned in the Backend Descriptor Types, and (2) limitations in relation to other operations in the directed acyclic graph (DAG), as mentioned in the Support Surface section.

MatMul

This operation represents matrix-matrix multiplication: A * B = C. For complete details on the interface, refer to the CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR section.

Convolutions

There are three operation nodes that represent different types of convolutions namely:

ConvolutionFwd

This operation represents forward convolution, that is, computing the response tensor of image tensor convoluted with filter tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR section.

ConvolutionBwFilter

This operation represents convolution backward filters, that is, computing filter gradients from a response and an image tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR section.

ConvolutionBwData

This operation represents convolution backward data, that is, computing input data gradients from a response and a filter tensor. For complete details on the interface, as well as general constraints, refer to the CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR section.

Tensor Attributes for all Three Operations

Input Tensor Attribute Name

Output Tensor Attribute Name

ConvolutionFwd

CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_X, CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_W

CUDNN_ATTR_OPERATION_CONVOLUTION_FORWARD_Y

ConvolutionBwFilter

CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DX, CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_DY

CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_DATA_W

ConvolutionBwData

CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DW, CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_DY

CUDNN_ATTR_OPERATION_CONVOLUTION_BWD_FILTER_X

The FP8 data type since NVIDIA Ada Lovelace architecture has two variants: CUDNN_DATA_FP8_E4M3 and CUDNN_DATA_FP8_E5M2 as I/O data types. Using them as inputs to the operation will result in FP8 Tensor Cores being used. The precision of the accumulation inside the FP8 Tensor Cores is controlled by the compute type, which may have one of two possible values: CUDNN_DATA_FLOAT and CUDNN_DATA_FAST_FLOAT_FOR_FP8.

CUDNN_DATA_FAST_FLOAT_FOR_FP8 is faster and sufficiently accurate for inference or the forward pass of training. However, for FP8 training backward pass computations (that is, computing weight and activation gradients), we recommend choosing the more accurate CUDNN_DATA_FLOAT compute type to preserve a higher level of accuracy which may be necessary for some models.

Recommended Compute Type for FP8 Tensor Computations for Ada Lovelace and Hopper Architecture

Operation

Recommended I/O Type

Recommended Compute Type

ConvolutionFwd

CUDNN_DATA_FP8_E4M3

CUDNN_DATA_FAST_FLOAT_FOR_FP8, CUDNN_DATA_FLOAT

ConvolutionBwData

CUDNN_DATA_FP8_E4M3

CUDNN_DATA_FLOAT

BatchNorm

CUDNN_DATA_FP8_E4M3

CUDNN_DATA_FLOAT

Pooling

CUDNN_DATA_FP8_E4M3, CUDNN_DATA_FP8_E5M2

CUDNN_DATA_FLOAT

Pointwise

CUDNN_DATA_FP8_E4M3, CUDNN_DATA_FP8_E5M2

CUDNN_DATA_FLOAT

Pointwise

Represents a pointwise operation that implements the equation Y = op (alpha1 * X) or Y = op (alpha1 * X, alpha2 * B). Refer to the CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR and CUDNN_BACKEND_POINTWISE_DESCRIPTOR sections for more information and general constraints.

The following tables list the constraints for Pointwise operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when these operations are used in the runtime fusion engines.

Constraints for Pointwise Operations for Support Surface 90 and 80

Attribute

Requirement

Tensor data type for CUDNN_ATTR_OPERATION_POINTWISE_XDESC, CUDNN_ATTR_OPERATION_POINTWISE_YDESC and, if applicable, CUDNN_ATTR_OPERATION_POINTWISE_BDESC, CUDNN_ATTR_OPERATION_POINTWISE_TDESC

For all operators, all data types are supported.

CUDNN_ATTR_POINTWISE_MATH_PREC

  • For any of the logical operators (CUDNN_POINTWISE_LOGICAL_AND, CUDNN_POINTWISE_LOGICAL_OR, and CUDNN_POINTWISE_LOGICAL_NOT), math precision needs to be CUDNN_DATA_INT32.

  • For any of the following operations (CUDNN_POINTWISE_ADD, CUDNN_POINTWISE_ADD_SQUARE, CUDNN_POINTWISE_DIV, CUDNN_POINTWISE_MAX, CUDNN_POINTWISE_MIN, CUDNN_POINTWISE_MOD, CUDNN_POINTWISE_ABS, CUDNN_POINTWISE_CEIL, CUDNN_POINTWISE_FLOOR, CUDNN_POINTWISE_MUL, CUDNN_POINTWISE_SUB, CUDNN_POINTWISE_NEG, CUDNN_POINTWISE_CMP_EQ, CUDNN_POINTWISE_CMP_NEQ, CUDNN_POINTWISE_CMP_GT, CUDNN_POINTWISE_CMP_GE, CUDNN_POINTWISE_CMP_LT, CUDNN_POINTWISE_CMP_LE, CUDNN_POINTWISE_GEN_INDEX, CUDNN_POINTWISE_BINARY_SELECT), math precision can be either CUDNN_DATA_FLOAT or CUDNN_DATA_INT32.

  • For any of the CUDNN_POINTWISE_IDENTITY operations, the math precision can be any data type. However, if the math precision is other than CUDNN_DATA_INT32 or CUDNN_DATA_FLOAT, the input data type, the output data type, and the math precision must be the same.

  • For all other operators, only CUDNN_DATA_FLOAT math precision is supported.

CUDNN_ATTR_OPERATION_POINTWISE_ALPHA1

1.0f

CUDNN_ATTR_OPERATION_POINTWISE_ALPHA2

1.0f

Constraints for Pointwise Operations for Support Surface 70

Attribute

Requirement

Tensor data type for CUDNN_ATTR_OPERATION_POINTWISE_XDESC, CUDNN_ATTR_OPERATION_POINTWISE_YDESC and, if applicable, CUDNN_ATTR_OPERATION_POINTWISE_BDESC

  • For any of the logical operators (CUDNN_POINTWISE_LOGICAL_AND, CUDNN_POINTWISE_LOGICAL_OR, and CUDNN_POINTWISE_LOGICAL_NOT), data type can be any of CUDNN_DATA_INT32, CUDNN_DATA_INT8, or CUDNN_DATA_BOOLEAN.

  • For all other operators, all data types are supported.

CUDNN_ATTR_POINTWISE_MATH_PREC

  • For any of the logical operators (CUDNN_POINTWISE_LOGICAL_AND, CUDNN_POINTWISE_LOGICAL_OR, and CUDNN_POINTWISE_LOGICAL_NOT), math precision needs to be CUDNN_DATA_BOOLEAN.

  • For all other operators, only CUDNN_DATA_FLOAT is supported.

CUDNN_ATTR_OPERATION_POINTWISE_ALPHA1

1.0f

CUDNN_ATTR_OPERATION_POINTWISE_ALPHA2

1.0f

GenStats

Represents an operation that generates per-channel statistics. Refer to the CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR section for more information and general constraints.

The following table lists the constraints for GenStats operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when GenStats operations are used in the runtime fusion engines.

Constraints for GenStats Operations

Attribute

Requirement

Tensor data type for CUDNN_ATTR_OPERATION_GENSTATS_XDESC

  • Prior to the NVIDIA Ampere architecture GPU: CUDNN_DATA_HALF

  • On NVIDIA Ampere architecture and later: CUDNN_DATA_HALF and CUDNN_DATA_FLOAT

Tensor shape for CUDNN_ATTR_OPERATION_GENSTATS_SUMDESC and CUDNN_ATTR_OPERATION_GENSTATS_SQSUMDESC

Both should be of shape [1, C, 1, 1] for 2D conv or [1, C, 1, 1, 1] for 3D conv.

Tensor data type for CUDNN_ATTR_OPERATION_GENSTATS_SUMDESC and CUDNN_ATTR_OPERATION_GENSTATS_SQSUMDESC

CUDNN_DATA_FLOAT

CUDNN_ATTR_POINTWISE_MATH_PREC

CUDNN_DATA_FLOAT

Tensor layout for CUDNN_ATTR_OPERATION_GENSTATS_XDESC, CUDNN_ATTR_OPERATION_GENSTATS_SUMDESC, and CUDNN_ATTR_OPERATION_GENSTATS_SQSUMDESC

NHWC fully packed

Reduction

This operation represents reducing values of a tensor in one or more dimensions. Refer to the CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR section for more information and general constraints.

The following table lists constraints for Reduction forward operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when Reduction operations are used in the runtime fusion engines.

Constraints for Reduction Operations

Attribute

Requirement

Tensor data type for CUDNN_ATTR_OPERATION_REDUCTION_YDESC

CUDNN_DATA_FLOAT

CUDNN_ATTR_REDUCTION_COMP_TYPE

CUDNN_DATA_FLOAT

Tensor layout for CUDNN_ATTR_OPERATION_REDUCTION_XDESC and CUDNN_ATTR_OPERATION_REDUCTION_YDESC

NHWC/NDHWC/BMN fully packed

CUDNN_ATTR_REDUCTION_OPERATOR

CUDNN_REDUCE_TENSOR_ADD, CUDNN_REDUCE_TENSOR_MIN, and CUDNN_REDUCE_TENSOR_MAX

ResampleFwd

This operation represents resampling of the spatial dimensions of an image to a desired value. Resampling is supported in both directions, upsampling and downsampling. Downsampling represents the standard operation of pooling, commonly used in convolutional neural networks. Refer to the CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR section for more information and general constraints.

The following are constraints for ResampleFwd operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when ResampleFwd operations are used in the runtime fusion engines.

We allow a choice amongst four modes for resample. All modes have the following common support specifications:

  • Supported layout: NHWC or NDHWC, NCHW or NCDHW

  • Spatial dimensions supported: 2 or 3

  • Input dimensions supported: 4 or 5

  • Packed boolean data type is not supported.

  • If specified, the index tensor dimension should be equal to the response tensor dimension.

When the tensor format is NCHW/NCDHW, the following additional restrictions apply:

  • Upsampling is not supported.

  • Int64_t indices are not supported.

  • Only supports symmetric padding using the prepadding backend API.

There are some mode specific restrictions also. The following tables list the values that are allowed for particular parameters. For the parameters not listed, we allow any value which is mathematically correct.

The following downsampling modes are supported:

  • CUDNN_RESAMPLE_AVGPOOL_INCLUDE_PADDING

  • CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING

  • CUDNN_RESAMPLE_MAXPOOL

Specific Restrictions for the Downsampling Modes

Attribute

Average Pooling

Max Pooling

CUDNN_ATTR_RESAMPLE_PADDING_MODE

CUDNN_ZERO_PAD

CUDNN_NEG_INF_PAD

CUDNN_ATTR_OPERATION_RESAMPLE_FWD_ALPHA

1.0

1.0

CUDNN_ATTR_OPERATION_RESAMPLE_FWD_BETA

0.0

0.0

CUDNN_ATTR_RESAMPLE_COMP_TYPE

CUDNN_DATA_FLOAT

CUDNN_DATA_FLOAT

For the upsampling modes, CUDNN_RESAMPLE_NEAREST is not supported for any combination of parameters. CUDNN_RESAMPLE_BILINEAR has the following support specifications.

Specific Restrictions for Upsampling Mode CUDNN_RESAMPLE_BILINEAR

Attribute

Bilinear

Input dimensions

Equal to 0.5 x output dimensions

CUDNN_ATTR_RESAMPLE_PRE_PADDINGS

0.5

CUDNN_ATTR_RESAMPLE_POST_PADDINGS

1

CUDNN_ATTR_RESAMPLE_STRIDES

0.5

CUDNN_ATTR_RESAMPLE_WINDOW_DIMS

2

Data type for CUDNN_ATTR_OPERATION_RESAMPLE_FWD_XDESC and CUDNN_ATTR_OPERATION_RESAMPLE_FWD_YDESC

CUDNN_DATA_FLOAT

CUDNN_ATTR_RESAMPLE_COMP_TYPE

CUDNN_DATA_FLOAT

CUDNN_ATTR_OPERATION_RESAMPLE_FWD_ALPHA

1.0

CUDNN_ATTR_OPERATION_RESAMPLE_FWD_BETA

0.0

CUDNN_ATTR_RESAMPLE_PADDING_MODE

CUDNN_EDGE_VAL_PAD

Resampling Index Tensor Dump for Training

For max-pooling resampling mode, an index tensor can be provided to be used as a mask for backpropagation.

Values in the index tensors are:

  • Zero-indexed row-major position of maximum value of input tensor in the resampling window.

  • In case of multiple input pixels with maximum value, the first index in a left-to-right top-to-bottom scan is selected.

An example of index element selection:

Values in the Index Tensors

Select an appropriate element size for the index tensor. As a reference, any element size such that the maximum zero-indexed window position fits should be sufficient.

ResampleBwd

This operation represents backward resampling of the spatial dimensions of an output response to a desired value. Resampling is supported in both directions, upsampling and downsampling. Backwards downsampling represents the standard operation of backward pooling, commonly used in convolutional neural networks. Refer to the CUDNN_BACKEND_OPERATION_RESAMPLE_BWD_DESCRIPTOR section for more information and general constraints.

The following are constraints for Resample backward operations, in addition to the general constraints listed above, and any constraints listed in the Support Surface section, in relation to other operations. Note that these additional constraints only apply when Resample backward operations are used in the runtime fusion engines.

We allow a choice amongst four modes for resample. All modes have the following common support specifications:

  • Supported layout: NHWC or NDHWC, NCHW or NCDHW

  • Spatial dimensions supported: 2 or 3

  • Input dimensions supported: 4 or 5

For layout NHWC or NDHWC:

  • The index tensor should be provided for only max pooling mode, and should adhere to the format described in the Resampling Index Tensor Dump for Training section.

  • The index tensor dimensions should be equal to the input gradient tensor dimensions.

For layout NCHW or NCDHW:

  • X, Y, and DY are required when max pooling mode is used.

  • Int64_t indices are not supported.

There are some mode specific restrictions also. The following tables list the values that are allowed for particular parameters. For the parameters not listed, we allow any value which is mathematically correct.

The following backward downsampling modes are supported:

  • CUDNN_RESAMPLE_AVGPOOL_INCLUDE_PADDING

  • CUDNN_RESAMPLE_AVGPOOL_EXCLUDE_PADDING

  • CUDNN_RESAMPLE_MAXPOOL

Specific Restrictions for the Backwards Downsampling Modes

Attribute

Average Pooling

Max Pooling

CUDNN_ATTR_RESAMPLE_PADDING_MODE

CUDNN_ZERO_PAD

CUDNN_NEG_INF_PAD

CUDNN_ATTR_OPERATION_RESAMPLE_BWD_ALPHA

1.0

1.0

CUDNN_ATTR_OPERATION_RESAMPLE_BWD_BETA

0.0

0.0

CUDNN_ATTR_RESAMPLE_COMP_TYPE

CUDNN_DATA_FLOAT

CUDNN_DATA_FLOAT

Backward upsampling modes are currently not supported.

Examples of Supported Patterns

The following sections provide examples of supported patterns, in order of increasing complexity. We employ the same color scheme as in the overall pattern to aid in identifying the structure of g 1 (blue) and g 2 (purple).

For illustration purposes, we abbreviated the operations used. For a full mapping to the actual backend descriptors, refer to the Mapping with Backend Descriptors.

Single Operation

The following example illustrates a convolution operation without any operations before or after it. This means, g 1 and g 2, are empty graphs.

Runtime Fusion Engines with a Single Operation
Pointwise Operations After Convolution 1

In this example, g 2 consists of a sequential set of two Pointwise operations after the convolution.

ConvolutionFwd Followed by a DAG with Two Operations
Pointwise Operations After Convolution 2

Similar to the previous example, g 2 consists of a sequential set of multiple Pointwise operations.

ConvolutionFwd Followed by a DAG with Three Operations
Pointwise Operations Before Matrix Multiplication

Pointwise operations can also precede a convolution or matrix multiplication, that is, g 1 is composed of Pointwise operations.

MatMul Preceded by a DAG with Two Operations
Convolution Producer Node in Middle of DAG

The following pattern shows g 1 as a DAG of Pointwise operations feeding into a convolution. In addition, g 2 is a DAG consisting of two Pointwise operations. Note that the convolution is being consumed in the middle of g 2 as opposed to g 2 first node. This is a valid pattern.

Fusion of Operations Before and After the ConvolutionFwd operation. In addition, we Observe that the Output of ConvolutionFwd can Feed Anywhere in g :sub:`2`.
Mixed Input MatMul

The mixed input GEMM is implemented as a special case of the mainloop fusion. Inputs may have different data types and will be converted to the desired data types serving as the inputs of the MatMul or Convolution operation by a Pointwise:Identity operation. The following pattern shows g 1 as a DAG of Pointwise:Identity operation converting the input data type of tensor A into a MatMul operation. This is a valid pattern.

Mixed Precision MatMul

Specialized Runtime Fusion Engines

The specialized runtime fusion engines target and optimize specialized graph patterns that commonly occur in popular deep learning models. These engines offer limited flexibility regarding supported fusion patterns, supported data types, and supported tensor layouts. Long term, these patterns are expected to be more generic.

The following sections highlight the supported patterns.

BnAddRelu

In ResNet-like vision models, batch normalization followed by ReLU activation is a commonly occurring pattern. The BNAddRelu fusion pattern, supported using a runtime compiled engine, aims to optimize this recurring operation graph. It also supports single node multi-GPU batch normalization for speeding up batch norm computation in multi-GPU systems. The pattern is intended for use in the forward pass during the training phase. The full pattern BNAddRelu with the add node is used in cases where there are skip connections in the model.

The pattern is illustrated in the following diagram and its options and limitations include:

  • The pointwise nodes: Add, ReLU, and GT (greater than) are optional.

  • All tensors should be in NHWC packed layout format.

  • Both 4D and 5D tensors are supported.

  • Only ReLU activation is supported.

  • The attribute CUDNN_ATTR_OPERATION_NORM_FWD_MODE for the norm forward operation must be set to CUDNN_BATCH_NORM.

  • The attribute CUDNN_ATTR_OPERATION_NORM_FWD_PHASE for the norm forward operation must be set to CUDNN_NORM_FWD_TRAINING.

  • The batch norm input tensors: Scale, Bias, Input_running_mean, and Input_running_var must be of float data type.

  • The batch norm output tensors: output_running_mean, output_running_var, mean, and InvVariance must be of float data type.

  • The batch norm input tensor BN_x, residual input Z and output tensor Y can be any of {FP32, FP16, BF16} data types. For FP16 and BF16 data types, the channel count C for the tensors must be a multiple of 8 while for float data type the channel count must be a multiple of 4.

  • These patterns are supported on devices with compute capability >= 8.0.

BnAddRelu cuDNN Operation Graph

In case of single node multi-GPU batch norm, each GPU computes the local statistics based on its input data and writes out the local statistics to the peerTensors. Each peerTensor resides on a separate GPU on the node and is used for reading and writing local statistics from the peer GPUs. This is followed by a global stats computation phase where each GPU aggregates the statistics from the peers and computes the global mean and variance for the batch norm output computation on its local data. Apart from the options and limitations listed above, the following additional restrictions apply for using multi-GPU batch norm:

  • The attribute CUDNN_ATTR_OPERATION_NORM_FWD_PEER_STAT_DESCS of the NormForward operation must be set.

  • The size of the peerTensors vector should be equal to the number of GPUs in the node participating in the batch norm computation.

  • The maximum size of the peerTensors vector is 32.

  • Each GPU should operate on the same size of input data [N,C,H,W].

  • The size of each peerTensor in the peerTensors vector should be equal to num_gpu * 4 * C where C is the channel count of the BN_x tensor and num_gpu is the number of GPUs in the node participating in the batch norm computation.

  • All the elements of each tensor in the peerTensors vector should be memset to 0 before passing that tensor in the variant pack.

Single Node Multi-GPU Batch Norm

DReluForkDBn

Similar to the BnAddRelu pattern, the DReluForkDBn pattern also targets ResNet-like vision networks. It is intended to be used in backpropagation during the training phase. The DReluForkDBn pattern is supported through a runtime compiled engine that usually complements the BnAddRelu pattern. It also supports single node multi-GPU batch normalization for speeding up batch norm backward computation in multi-GPU systems.

The pattern is illustrated in the following diagram and its options and limitations include:

  • The pointwise node dRelu is optional.

  • The intermediate tensor dZ can be virtual or non-virtual.

  • All tensors should be in NHWC packed layout format.

  • Both 4D and 5D tensors are supported.

  • Only dRelu activation is supported.

  • Bitmask tensor input is needed for the dRelu node.

  • The attribute CUDNN_ATTR_OPERATION_NORM_BWD_MODE for the norm backward operation must be set to CUDNN_BATCH_NORM.

  • The batch norm backward input tensors: Scale, Mean, InvVariance and the output tensors dScale and dBias must be of float data type.

  • dRelu input tensor dY, batch norm backward input BN_x, bias gradient dZ, and output tensor dX can be any of {FP32, FP16, BF16} data types. For FP16 and BF16 data types, the channel count C for the tensors must be a multiple of 8 while for float data type the channel count must be a multiple of 4.

  • These patterns are supported on devices with compute capability >= 8.0.

DReluForkDBn cuDNN Operation Graph

The single node multi-GPU version of this pattern is typically used for dScale and dBias gradient aggregation across GPUs. For using the multi-GPU version, the attribute CUDNN_ATTR_OPERATION_NORM_BWD_PEER_STAT_DESCS of the NormBackward operation must be set. Other restrictions for the peerTensors vector listed in the previous section apply for this pattern as well.

Fused Attention fprop

Mha-Fprop fusions \(O=matmul\left( S=g_{4} \left( P=matmul\left( Q, g_{3}\left( K \right) \right), V \right)\right)\) have been added to the runtime fusion engine to serve patterns that are commonly used in attention. These patterns can be used in BERT, T5, and so on.

There are two key differences to the flash fused attention patterns described in later sections:

  1. Input sizes supported contain small sequence lengths (<= 512).

  2. The operation graph is flexible to switch between different types of masks, different operations between the two matrix multiplications, and so on.

Mha-fprop cuDNN Operation Graph

g 3 can be an empty graph or a single scale operation with the scale being a scalar value (CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_MUL).

g 4 can be empty or the combination of the following DAGs of cuDNN operations. Each of these DAGs is optional, as shown by the dotted line.

DAGs of cuDNN operations

The combination has to obey the order in which we present them. For example, if you want to use the padding mask and softmax, the padding mask has to appear before softmax.

These operations are commonly used in attention. In the following diagram, we depict how to create a DAG for each of the operations. In later versions, we will be expanding the possible DAGs for g 3 and g 4.

Padding Mask

cuDNN graph depicting DAG:Padding Mask

Causal Mask

cuDNN graph depicting DAG:Causal Mask

Softmax

cuDNN graph depicting DAG:Softmax

Dropout

cuDNN graph depicting DAG:Dropout

g 4 is capable of storing an intermediate tensor to global memory marked as S, which can be used for fused multi-head attention bprop. Both DAG:Softmax and DAG:Dropout have this capability. Set S as the output from the last DAG in the graph.

The tensor descriptor marked as S must have the CUDNN_ATTR_TENSOR_REORDERING_MODE set to CUDNN_TENSOR_REORDERING_F16x16. This is because the tensor is stored in a special format and can only be consumed by fused attention bprop.

There is an additional option of generating the mask on the user end and passing it directly to the pointwise multiplier. The mask needs to be of I/O data type FP16/BF16 and S will store the mask in the sign bit to communicate to bprop.

Limitations Of Mha-fprop Fusions

Limitation

MatMul

  • Compute type for both MatMul ops must be float.

  • Input tensors must have data type FP16 or BF16.

  • Output tensors must have data type FP16, BF16, or FP32 (TF32).

Pointwise operations in g 3 and g 4

Compute type must be FP32 (TF32).

Reduction operations in g 3 and g 4

I/O types and compute types must be FP32 (TF32).

RNG operation in g 3 and g 4

  • Data type of yTensor must be FP32 (TF32).

  • The CUDNN_TYPE_RNG_DISTRIBUTION must be CUDNN_RNG_DISTRIBUTION_BERNOULLI.

Layout requirements of Mha-fprop fusions include:

  • All I/O tensors must have 4 dimensions, with the first two denoting the batch dimensions. The usage of rank-4 tensors in MatMul ops can be read from the Backend Descriptor Types documentation.

  • The contracting dimension (dimension K) for the first MatMul must be 64.

  • The non-contracting dimension (dimensions M and N) for the first MatMul must be less than or equal to 512. In inference mode, any sequence length is functional. For training, support exists only for multiples of 64.

  • The last dimension (corresponding to hidden dimensions) in Q, V, and O is expected to have stride 1.

  • For the K tensor, the stride is expected to be 1 for the 2nd last dimension.

  • The S tensor is expected to have CUDNN_ATTR_TENSOR_REORDERING_MODE set to CUDNN_TENSOR_REORDERING_F16x16.

Fused Attention bprop

Mha-Bprop fusions are executed in a fused pattern in a single kernel.

\(dV=matmul\left( g_{5}\left( S \right), dO \right)\) \(dS=matmul\left( dO, VT\right)\) \(dQ=matmul\left( g_{6}\left( dS \right), K \right)\) \(dK=matmul\left( Q, g_{7}\left( dS \right)\right)\)

cuDNN supports the corresponding backpropagation graph for fused attention. This can be used together with the fused attention fprop graph to perform training on models that have similar architectures to BERT and T5. This is not compatible with the flash fused attention bprop operation graph.

Mha-bprop cuDNN Operation Graph

g 5, g 6, and g 7 can only support a fixed DAG. We are working towards generalizing these graphs.

cuDNN Graph Depicting g :sub:`5`

g 6 represents the backward pass of softmax and masking, to get dP.

cuDNN Graph Depicting g :sub:`6`

There are options for the Mask DAG that you can opt-in. You can either use the padding/causal mask, general mask as an input, or not do any masking.

cuDNN Graph Depicting Mask DAG

dBias DAG is useful to calculate the bprop of the relative positional encoding and is optional and available for you to opt-in.

cuDNN Graph Depicting dBias DAG

g 7 is the transpose of dP the output of g 6.

cuDNN Graph Depicting g :sub:`7`
Limitations Of Mha-bprop Fusions

Limitation

MatMul

  • Compute type for both MatMul ops must be float.

  • Input tensors must have data type FP16 or BF16.

  • Output tensors must have data type FP16, BF16, or FP32 (TF32).

Pointwise operations in g 5, g 6, and g 7

Compute type must be FP32 (TF32).

Reduction operations in g 5, g 6, and g 7

I/O types and compute types must be FP32 (TF32).

Layout requirements of Mha-bprop fusions include:

  • All I/O tensors must have 4 dimensions, with the first two denoting the batch dimensions. The usage of rank-4 tensors in MatMul ops can be read from the Backend Descriptor Types documentation.

  • The contracting dimension (dimension K) for the second MatMul must be 64.

  • The contracting dimension (dimension K) for the first, second, and third MatMul must be less than or equal to 512 and a multiple of 64.

  • The last dimension (corresponding to hidden dimensions) in Q, K, V, O, and dO is expected to have stride 1.

  • The S and dP tensors are expected to have CUDNN_ATTR_TENSOR_REORDERING_MODE set to CUDNN_TENSOR_REORDERING_F16x16.

Fused Flash Attention fprop

cuDNN supports flash fused attention to perform scale dot product attention commonly used in models like GPT, BERT, and so on. The general pattern supported by this engine is BMM-Softmax-BMM with many other optional features that you can opt into. You can choose to create the graph by yourself or use the custom sdpa node in cuDNN frontend. Using the frontend node will make opting into the different options like causal mask, dropout, alibi masking, and so on, very easy.

Flash fprop cuDNN Operation Graph

Pre-softmax optional DAGs cover multiple options for the users to configure:

  • pointwise Multiply node for attention scale after the first matmul

  • pointwise Add node for the relative positional encoding to add a bias after the first matmul

  • Different masking options like causal masking, padding masking, sliding window attention, and alibi masking. Users can choose multiple masking schemes together or no masking.

  • pointwise Multiply node that accepts a full tensor that can be used as a custom mask generated by the user

Post-softmax optional DAGs cover multiple options for the users to configure:

  • pointwise Multiply node with a RNG node to signify dropout

  • pointwise Multiply node with a user generated tensor acting as the dropout mask

All these DAGs are optional. A user can enable them depending on the cuDNN API they are targeting. If using the sdpa node in cuDNN frontend, they can set the provided API options to true, for example use_causal_mask(True) and internally, the frontend will add the correct graph automatically. While using the graph API directly, users can add the corresponding graph of the operations they want into the cuDNN graph.

The compound operations for example: Causal Mask, Sliding Window Mask, Softmax, and so on, can be represented using the following operation graphs in cuDNN.

Snapshot of support of Fused Flash Attention ``fprop``

Causal Mask

Flash ``fprop`` Causal Mask Operation Graph

Padding Mask

Flash ``fprop`` Padding Mask Operation Graph

Sliding Window Mask

Flash ``fprop`` Sliding Window Mask Operation Graph

Alibi Mask

Flash ``fprop`` Alibi Mask Operation Graph

Softmax

Flash fprop Softmax Operation Graph

Dropout

Flash fprop Dropout Operation Graph
Limitations For The Input And The Output Non-Virtual Tensors

Limitation

Q, K^T, and V tensor

  • All tensors must be either FP16 or BF16 data type.

  • Contracting dimension for Q must be a multiple of 8 with maximum value 256.

  • Non-contracting dimension for V must be a multiple of 8 with maximum value 256.

  • Contracting dimension for Q and K^T needs to have stride 1 in the layout.

  • Non-contracting dimension for V needs to have stride 1 in the layout.

  • The second dimension in K^T corresponding to the number of heads can be a factor of the number of heads of Q.

  • The second dimension in V corresponding to the number of heads can be a factor of the number of heads of Q.

Softmax stats

  • Data type must be FP32.

  • Data must be in row major format.

O tensor

  • Data type must be either FP16 or BF16.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

Seed and Offset

INT32 or INT64 scalar in host or GPU

Scale to Pointwise

Attention scale can be FP16/BF16/FP32.

Inference mode can be turned on by passing the Softmax stats as a virtual tensor and setting the RNG node probability to 0.0f. The pattern is supported for GPUs with NVIDIA Ampere architecture and newer.

In cuDNN version >= 9.1.0, we have also introduced Fused Flash Attention in FP8 data type supported on NVIDIA Hopper GPUs. In addition to the standard fprop graph, there are additional dequantization scales, quantization scales, and absolute max (amax) calculations. The current FP8 support is a subset of the features supported in BF16 support. We are actively working on expanding the support for the FP8 kernels. For more information about the scales and amax, refer to the Specialized Pre-Compiled Engines, FP8 Fused Flash Attention section.

In the following diagram, red tensors indicate FP8 datatype tensors and black tensors are in FP32 datatype.

FP8 Flash fprop Operation Graph

Pre-softmax optional DAGs cover multiple options for you to configure:

  • pointwise Multiply node for attention scale after the first MatMul

  • Masking options include causal masking and no masking

Post-softmax optional DAGs cover multiple options for you to configure:

  • Currently there is no support for dropout

Limitations For The Input And The Output Non-Virtual Tensors for FP8 Flash Attention

Limitation

Q, K^T, and V tensor

  • All tensors must be either E4M3 or E5M2 data type.

  • Contracting dimension for Q must be a multiple of 16 with maximum value of 128.

  • Non-contracting dimension for V must be a multiple of 16 with maximum value of 128.

  • Contracting dimension for Q and K^T needs to have stride 1 in the layout.

  • Non-contracting dimension for V needs to have stride 1 in the layout.

  • The second dimension in K^T corresponding to the number of heads can be a factor of the number of heads of Q.

  • The second dimension in V corresponding to the number of heads can be a factor of the number of heads of Q.

Softmax stats

  • Data type must be FP32.

  • Data must be in row major format.

O tensor

  • Data type must be either E4M3 or E5M2.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

Scale to Pointwise

Attention scale can be FP32.

Fused Flash Attention bprop

cuDNN supports the corresponding backpropagation graph for fused flash attention. This can be used together with the fprop graph to perform training on Large Language Models (LLMs).

All the options mentioned in fprop are applicable in the bprop graph as well. The corresponding bprop frontend node contains the same options and can be configured to do the bprop. Users opting in for the graph API, again need to add the graphs of the operations they want. The graph shown below is for a standard attention layer in GPT with causal masking and dropout.

For Grouped Query Attention (GQA) and Multi Query Attention (MQA), you can configure an additional reduction node for dK and dV, which reduces the tensor from the full number of heads (Q heads) to the actual K and V heads.

For the input and output tensors, the limitations from the fprop graph are carried over. For the bprop specific tensors, the limitations are as follows:

Limitations For The bprop Specific Tensors

Limitation

dQ, dK, and dV tensor

  • All tensors must be either FP16 or BF16 data type.

  • The last dimension corresponding to the hidden dim per head must be a multiple of 8 with maximum value 128.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

Softmax sum

  • Data type must be FP32.

  • Data must be in row major format.

dO tensor

  • Data type must be either FP16 or BF16.

  • The last dimension corresponding to the hidden dim per head must be a multiple of 8 with maximum value 128.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

  • The layout of the tensor is required to be the same as the O tensor.

dqAccum tensor

  • Data type must be FP32.

  • The tensor must be memset to zero before passing to cuDNN.

  • Data must be in row major format.

Flash bprop cuDNN Operation Graph

The pattern is supported for GPUs with NVIDIA Ampere architecture and newer.

In cuDNN version >= 9.1.0, we have also introduced Fused Flash Attention bprop in native FP8 data type supported on NVIDIA Hopper GPUs. In addition to the standard bprop graph, there are additional dequantization scales, quantization scales, and absolute max (amax) calculations. The current FP8 Flash Attention bprop support is corresponding to the FP8 Flash Attention fprop support.

In the following diagram, red tensors indicate FP8 datatype tensors and black tensors are in FP32 datatype.

FP8 Flash bprop Operation Graph
Limitations For The FP8 Fused Flash Attention bprop Specific Tensors

Limitation

dQ, dK, and dV tensor

  • All tensors must be either E4M3 or E5M2 data type.

  • The last dimension corresponding to the hidden dim per head must be 128.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

dO tensor

  • Data type must be either E4M3 or E5M2.

  • The last dimension corresponding to the hidden dim per head must be 128.

  • The stride for the last dimension corresponding to the hidden dim per head should be 1.

  • The layout of the tensor is required to be the same as the O tensor.

Specialized Pre-Compiled Engines

The pre-compiled specialized engines target and optimize for a specialized graph pattern with a ragged support surface. Because of this targeting, these graphs do not require runtime compilation.

In most cases, the specialized patterns are just special cases of the generic patterns used in the runtime fusion engines, but there are some cases where the specialized pattern does not fit any of the generic patterns. If your graph pattern matches a specialized pattern, you will get at least a pattern matching engine, and you might also get runtime fusion engines as another option.

Currently, the following patterns are supported by the pattern matching engines. Some nodes are optional. Optional nodes are indicated by dashed outlines.

ConvBNfprop

The ConvBNfprop pattern is illustrated in the following figure. Its restrictions and options include:

  • The three pointwise nodes scale, bias, and ReLU are optional.

  • X, Z, W, s 1, b :sub:1 must all be of FP16 data type.

  • Z needs to be of shape [N, C, H, W] with NHWC packed layout.

  • W needs to be of shape [K, C, R, S] with KRSC packed layout.

  • s 1, b :sub:1 need to be of shape [1, C, 1, 1] with NHWC packed layout.

  • Only ReLU activation is supported.

  • All of the intermediate tensors need to be virtual, except Y needs to be non-virtual.

  • I/O pointers should be 16 bytes aligned.

  • This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).

  • On devices with compute capability >= 9.0, we only support two patterns:

    • the full pattern: scale + bias + ReLU + Conv + GenStats, and

    • the partial pattern: Conv + GenStats.

ConvBNfprop, A Pre-Compiled Engine, Fuses ConvolutionFwd and GenStats With Several Pointwise Operations

Skip connections are commonly observed in ResNet-like models. To support fusions in skip connections, we support a variant of the pattern above, the DBARCS pattern (short for Dual, Scale, Bias, Add, ReLU, Conv genStats). The limitations and options of the DBARCS pattern include:

  • The pointwise dual scale and dual bias nodes are either both present or not. This is indicated by the dashed block encircling the dual scale and dual bias nodes. In case both the nodes are missing, the dual_X tensor is directly fed as input to the add node.

  • The pointwise nodes scale, bias, add, and ReLU are required nodes.

  • Currently, only supported on Hopper GPUs.

  • For all the other data types, layout and virtualness restrictions of the ConvBNfprop pattern apply to this pattern as well.

  • dual_X, dual_scale, and dual_bias must all be of FP16 data type.

  • dual_scale and dual_bias must be of shape [1,C,1,1] with NHWC packed layout.

  • Intermediate outputs of the ReLU and Conv nodes: Relu_Y and Y are non-virtual. All the other intermediate outputs are virtual.

  • The weight tensor W for the convolution needs to be of shape [K,C,1,1]. Only 1x1 filters with padding 0 are supported for the convolution in the DBARCS pattern.

DBARCS In The convBNfprop Series For Supporting Fusions Across Skip Connections

ConvBNwgrad

The ConvBNwgrad pattern is illustrated in the following figure. Its restrictions and options include:

  • The three pointwise operations are all optional, as indicated by the dashed outlines.

  • Only ReLU activation is supported.

  • X, s 1, b :sub:1, and dy must all be of FP16 datatype.

  • I/O pointers should be 16 bytes aligned.

  • X, s 1, b :sub:1, and dy must all have NHWC packed layouts.

  • All the intermediate tensors need to be virtual.

  • This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).

  • On devices with compute capability >= 9.0, support is restricted to:

    • the full pattern: scale + bias + ReLU + wgrad.

ConvBNwgrad, A Pre-Compiled Engine, Fuses ConvolutionBwFilter With Several (Optional) Pointwise Operations

ConvBiasAct

The ConvBiasAct pattern is illustrated in the following figure. Its restrictions and options include:

  • \(\alpha_{1}\) and \(\alpha_{2}\) need to be scalars.

  • The activation node is optional.

  • The size of the bias tensor should be [1, K, 1, 1].

  • Internal conversions are not supported. That is, the virtual output between nodes needs to have the same data type as the nodes compute type, which should be the same as the epilog type of the convolution node.

  • There are some restrictions on the supported combination of data types, which can be found in the API Reference (refer to cudnnConvolutionBiasActivationForward()).

ConvBiasAct, A Pre-Compiled Engine, Fuses ConvolutionFwd With Several Pointwise Operations

ConvScaleBiasAct

The ConvScaleBiasAct pattern is illustrated in the following figure. Its restrictions and options include:

  • \(\alpha_{1}\), \(\alpha_{2}\), and \(b_{1}\) should have the same data type/layout and can only be FP32.

  • X, W, and Z can only be INT8x4 or INT8x32.

  • The size of the bias tensor should be [1, K, 1, 1].

  • Internal conversions are not supported. That is, the virtual output between nodes needs to be the same as their compute type.

  • Currently, Pointwise:ReLU is the only optional pointwise node.

ConvScaleBiasAct, A Pre-Compiled Engine

This pattern is very similar as ConvBiasAct. The difference is that here, the scales \(\alpha_{1}\) and \(\alpha_{2}\) are tensors, not scalars. If they are scalars, this pattern becomes a normal ConvBiasAct.

DgradDreluBNBwdWeight

The DgradDreluBNBwdWeight pattern is illustrated in the following figure. Its restrictions and options include:

  • Dgrad input dy and W are of FP16 datatypes.

  • Batch norm fwd inputs, X_bn is of FP16 datatype while the other tensors mean_bn, invstd_dev_bn, scale_bn, and bias_bn are FP32.

  • Outputs: dScale, dBias, A, B, C are of FP32 data type.

  • All pointers are 16 byte aligned.

  • This pattern is only supported on devices with compute capability >= 8.0 (with the exception of NVIDIA Ada Lovelace architecture, 8.9).

DgradDreluBNBwdWeight is a pre-compiled engine that can be used in conjunction with the dBNApply pattern to compute the backwards path of batch norm.

DgradDreluBNBwdWeight Pattern For Fusions In The Backward Pass

The BNBwdWeight operation takes in five inputs: X_bn, mean_bn, invstddev_bn, scale_bn, and dy_bn (that is, the output from the ReLUBwd node).

It produces five outputs: gradients of the batch norm scale and bias params, dScale, dBias, and coefficients A, B, C. Note that for illustration purposes, the inputs are duplicated. The inputs on the left and right are however exactly the same.

This pattern is typically used in the computation of the Batch Norm Backward Pass.

When computing the backward pass of batch norm, dScale, dBias, and dX_bn are needed. The DgradDreluBnBwdWeight pattern computes the former two. Using the generated A, B, and C we can use the following dBNApply pattern to compute dX, the input gradient, as follows dx_bn = A*dy_bn + B*X_bn +C.

dBNApply Pattern For Final Gradient Computation

The dBNApply pattern was initially supported by a pre-compiled static engine but is now supported by the generic runtime fusion engine.

Note that the DgradDreluBNBwdWeight pattern is used in combination with the forward pass pattern ConvBNfprop. Because of performance reasons, the output of batch norm Y_bn, which was calculated in ConvBNfprop (output of scale-bias), needs to be recalculated by DgradDreluBnBwdWeight. The pointwise add node subtracts mean_bn from X_bn, hence the alpha2 parameter for that node should be set to -1.

FP8 Fused Flash Attention

cuDNN supports fused flash attention with input and output data types being in FP8 format. This FP8-specific graph pattern is supported only on Hopper (H100) GPUs.

Support exists for both training (forward and backward pass) and inference in FP8 format. The training forward pass is slightly different from the inference forward pass regarding whether some intermediate tensors are output or not.

Within the NVIDIA Hopper architecture, there are two new FP8 formats: E4M3 and E5M2. Currently, for forward pass, only when all the inputs and outputs are in E4M3 format is supported. For the backward pass, the support is only when some of the inputs and outputs are in E4M3 and some in E5M2. More general support for the FP8 formats will be added in future releases.

Due to the limited numerical precision of FP8 data type, for practical use cases, you must scale values computed in FP32 format before storing them in FP8 format, and descale the values stored in FP8 format before performing computations on them. For more information, refer to the Transformer Engine FP8 primer.

The following notation is used in this section.

  • b - number of batches

  • h - number of heads

  • s - maximum length of sequences in a batch

  • d - embedding dimension size of a word in a sequence

Scaling and Descaling

In the context of FP8, scaling refers to multiplying each element of a FP32 tensor by a quantization factor.

The quantization factor is computed as: (Max representable value in the FP8 format) / (Max absolute value seen in the tensor).

For the E4M3 format, the quantization factor is 448.f/ tensor_amax (rounded to the nearest lower power of two).

For the E5M2 format, the quantization factor is 57344.f / tensor_amax (rounded to the nearest lower power of two).

The dequantization factor is the reciprocal of the quantization factor.

The meaning behind scaling is to spawn the full range of the FP8 format when computing on FP8 values and storing FP8 values, thereby, minimizing the precision loss. True values in FP32 format are multiplied by the quantization factor before storing them as scaled values in FP8 format. Computations on scaled values in FP8 format are descaled by multiplying with the dequantization factor to convert them back to their true values in FP32 format.

Scaling and descaling are critical for convergence with the FP8 data type, hence cuDNN only supports graph patterns for FP8 fused attention with the scaling and descaling nodes present.

Unpadded Tensors

In fused flash attention, the length of different sequences in a batch can be different. The cuDNN operation graph supports an unpadded layout where all the sequences of different lengths in a batch are tightly packed. All the word embeddings after the useful length of the sequence are pushed towards the end of all sequences in the layout.

Forward Pass

The following figure shows the cuDNN operation graph for the fused attention forward pass. The same graph supports forward pass for both training and inference. The operation graph pattern is identified as training when M and Zinv tensors are non-virtual. When M and Zinv tensors are virtual, the operation graph pattern is identified as inference.

The FP8 tensors are expected to be scaled and the matrix multiplication computation is performed on the FP8 tensors in the scaled format. All non matrix multiplication computations are performed in FP32 precision. The output of the FP8 matrix multiplication is converted to real values in FP32 by format multiplying with the descale values.

FP8 Fused Attention Forward Pass Operation Graph
FP8 Fused Attention Forward Pass Input Tensors

Tensor Name

Data Type

Dimensions

Q

E4M3

[b, h, s, d]

K

E4M3

[b, h, s, d]

V

E4M3

[b, h, s, d]

Attention Scale

FP32 (by value)

[1, 1, 1, 1]

Descale Q

FP32

[1, 1, 1, 1]

Descale K

FP32

[1, 1, 1, 1]

Descale V

FP32

[1, 1, 1, 1]

Scale S

FP32

[1, 1, 1, 1]

Descale S

FP32

[1, 1, 1, 1]

Scale O

FP32

[1, 1, 1, 1]

RNG Seed

INT64

[1, 1, 1, 1]

RNG Offset

INT64

[1, 1, 1, 1]

Dropout Probability (p) or Keep Probability (1 - p)

FP32

[1, 1, 1, 1]

FP8 Fused Attention Forward Pass Output Tensors

Tensor Name

Data Type

Dimensions

O

E4M3

[b, h, s, d]

Amax_O

FP32

[1, 1, 1, 1]

M

FP32 (training only)

[b, h, s, 1]

Zinv

FP32 (training only)

[b, h, s, 1]

FP8 Fused Attention Forward Pass Limitations

Limitation

Q, K, and V tensors

  • Required to be interleaved.

  • Packing layout is (b,s,3,h,d).

  • Must be in the E4M3 FP8 data type.

  • Can be in an unpadded tightly packed layout.

  • Either Q, K, V, and O can all be unpadded or none.

O tensor

  • Must be in (b,s,h,d) layout.

  • Must be in the E4M3 format.

  • Can be in an unpadded tightly packed layout.

  • Either Q, K, V, and O can all be unpadded or none.

All virtual tensors

All virtual tensors must be in FP32 format.

Compute precision

The compute precision of all operations must be FP32.

M, Zinv

  • Must be in FP32 format.

  • Should not be in an unpadded tightly packed layout.

Attention Scale, Quantization (scale) and Dequantization (descale) tensors

Must be in FP32 format.

Match between cuDNN API specifications and passed in device tensors

There is an implicit dependency between specifications done through cuDNN API and device tensors passed in runtime. The execution of the operation graph is undefined when the passed in device tensors do not conform with the API specifications.

Dropout probability for inference (p)

The dropout probability for forward pass inference must be zero.

Embedding dimension size (d)

Embedding dimension size of only 64 is supported.

Maximum sequence length (s)

Maximum sequence length sizes must be a multiple of 64. In addition, maximum sequence lengths sizes up-to only 512 are supported.

Backward Pass

FP8 Fused Attention Backward Pass Operation Graph
FP8 Fused Attention Backward Pass Input Tensors

Tensor Name

Data Type

Dimensions

Q

E4M3

[b, h, s, d]

K

E4M3

[b, h, s, d]

V

E4M3

[b, h, s, d]

O

E4M3

[b, h, s, d]

dO

E5M2

[b, h, s, d]

M

FP32

[b, h, s, 1]

Zinv

FP32

[b, h, s, 1]

Attention Scale

FP32 (by value)

[1, 1, 1, 1]

Descale Q

FP32

[1, 1, 1, 1]

Descale K

FP32

[1, 1, 1, 1]

Descale V

FP32

[1, 1, 1, 1]

Scale S

FP32

[1, 1, 1, 1]

Descale S

FP32

[1, 1, 1, 1]

Descale O

FP32

[1, 1, 1, 1]

Descale dO

FP32

[1, 1, 1, 1]

Scale dS

FP32

[1, 1, 1, 1]

Descale dS

FP32

[1, 1, 1, 1]

Scale dQ

FP32

[1, 1, 1, 1]

Scale dK

FP32

[1, 1, 1, 1]

Scale dV

FP32

[1, 1, 1, 1]

RNG Seed

INT64

[1, 1, 1, 1]

RNG Offset

INT64

[1, 1, 1, 1]

Dropout Probability (p) or Keep Probability (1 - p)

FP32

[1, 1, 1, 1]

FP8 Fused Attention Backward Pass Output Tensors

Tensor Name

Data Type

Dimensions

dQ

E5M2

[b, h, s, d]

dK

E5M2

[b, h, s, d]

dV

E5M2

[b, h, s, d]

Amax_dQ

FP32

[1, 1, 1, 1]

Amax_dK

FP32

[1, 1, 1, 1]

Amax_dV

FP32

[1, 1, 1, 1]

Amax_dS

FP32 (virtual tensor dS is of E5M2 type)

[1, 1, 1, 1]

FP8 Fused Attention Backward Pass Limitations

Limitation

Q, K, and V tensors

  • Required to be interleaved.

  • Packing layout is (b,s,3,h,d).

  • Must be in the E4M3 FP8 format.

  • Can be in an unpadded tightly packed layout.

  • Either Q, K, V, O, dO, dQ, dK, and dV can all be unpadded or none.

dQ, dK, and dV tensors

  • Required to be interleaved.

  • Packing layout is (b,s,3,h,d).

  • Must be in the E5M2 FP8 format.

  • Can be in an unpadded tightly packed layout.

  • Either Q, K, V, O, dO, dQ, dK, and dV can all be unpadded or none.

O and dO tensor

  • Must be in (b,s,h,d) layout.

  • Must be in the E4M3 format.

  • Can be in an unpadded tightly packed layout.

  • Either Q, K, V, O, dO, dQ, dK, and dV can all be unpadded or none.

All virtual tensors

All virtual tensors must be in FP32 format.

Compute precision

The compute precision of all operations must be FP32.

M, Zinv

  • Must be in FP32 format.

  • Should not be in an unpadded tightly packed layout.

Attention Scale, Quantization (scale) and Dequantization (descale) tensors

Must be in FP32 format.

Match between cuDNN API specifications and passed in device tensors

There is an implicit dependency between specifications done through cuDNN API and device tensors passed in runtime. The execution of the operation graph is undefined when the passed in device tensors do not conform with the API specifications.

Embedding dimension size (d)

Embedding dimension size of only 64 is supported.

Maximum sequence length (s)

Maximum sequence length sizes must be a multiple of 64. In addition, maximum sequence lengths sizes up-to only 512 are supported.

Mapping with Backend Descriptors

For readability, the operations used in this section are abbreviated. The mapping with the actual backend descriptors can be found in this table:

Notations and Backend Descriptors

Notations Used In This Section

Backend Descriptor

Pointwise:scale

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_MUL and with operand B broadcasting into operand X

Pointwise:bias

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_ADD and with operand B broadcasting into operand X

Pointwise:add

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_ADD and with operand B with same dimensions as X

Pointwise:mul

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_MUL and with operand B with same dimensions as X

Pointwise:ReLU

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_RELU_FWD

Pointwise:ReLUBwd

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_RELU_BWD

Pointwise:tanh

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_TANH_FWD

Pointwise:sigmoid

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_SIGMOID_FWD

Pointwise:ELU

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with mode CUDNN_POINTWISE_ELU_FWD

Pointwise:{ReLU,tanh,sigmoid,ELU}

CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR with one of the following modes: CUDNN_POINTWISE_RELU_FWD, CUDNN_POINTWISE_TANH_FWD, CUDNN_POINTWISE_SIGMOID_FWD, CUDNN_POINTWISE_ELU_FWD

MatMul

CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR

ConvolutionFwd

CUDNN_BACKEND_OPERATION_CONVOLUTION_FORWARD_DESCRIPTOR

ConvolutionBwFilter

CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_FILTER_DESCRIPTOR

ConvolutionBwData

CUDNN_BACKEND_OPERATION_CONVOLUTION_BACKWARD_DATA_DESCRIPTOR

GenStats

CUDNN_BACKEND_OPERATION_GEN_STATS_DESCRIPTOR

ResampleFwd

CUDNN_BACKEND_OPERATION_RESAMPLE_FWD_DESCRIPTOR

Reduction

CUDNN_BACKEND_OPERATION_REDUCTION_DESCRIPTOR

BnBwdWeight

CUDNN_BACKEND_OPERATION_BN_BWD_WEIGHTS_DESCRIPTOR

NormForward

CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR

NormBackward

CUDNN_BACKEND_OPERATION_NORM_BACKWARD_DESCRIPTOR

BOOLEAN / packed-BOOLEAN

CUDNN_DATA_BOOLEAN: As described in the cuDNN API Reference, this type implies that eight boolean values are packed in a single byte, with the lowest index on the right (that is, least significant bit). packed-BOOLEAN and BOOLEAN are used interchangeably, where the former is used to emphasize and remind the user about the semantics.

INT8

CUDNN_DATA_INT8

FP8

CUDNN_DATA_FP8_E4M3 or CUDNN_DATA_FP8_E5M2

FP16

CUDNN_DATA_HALF

BF16

CUDNN_DATA_BFLOAT16

FP32

CUDNN_DATA_FLOAT

TF32

A tensor core operation mode used to accelerate floating point convolutions or MatMuls. This can be used for an operation with compute type CUDNN_DATA_FLOAT, on NVIDIA Ampere architecture or later and be disabled with NVIDIA_TF32_OVERRIDE=1.