Layer Fusion Catalog#

This page catalogs the fusion patterns TensorRT applies at build time. For an overview of why fusion matters and how to inspect fusion decisions in logs, see Enabling Layer Fusion.

Layer Fusion#

TensorRT attempts to perform many different types of optimizations in a network during the build phase. In the first phase, layers are fused whenever possible. Fusions transform the network into a simpler form but preserve the same overall behavior. Internally, many layer implementations have extra parameters and options that are not directly accessible when creating the network. Instead, the fusion optimization step detects supported patterns of operations and fuses multiple layers into one layer with an internal options set.

Consider the common case of a convolution followed by ReLU activation. Creating a network with these operations involves adding a Convolution layer with addConvolutionNd and following it with an Activation layer using addActivation with an ActivationType of kRELU. The unoptimized graph will contain separate layers for convolution and activation. The internal implementation of convolution supports computing the ReLU function on the output in one step directly from the convolution kernel without requiring a second kernel call. The fusion optimization step will detect the convolution followed by ReLU. Verify that the implementation supports the operations, then fuse them into one layer.

To investigate which fusions have occurred, the builder logs its operations to the logger object provided during construction. Optimization steps are at the kINFO log level. To view these messages, ensure you log them in the ILogger callback.

Fusions are normally handled by creating a new layer with a name containing the names of both of the layers that were fused. For example, a MatrixMultiply layer (InnerProduct) named ip1 is fused with a ReLU Activation layer named relu1 to create a new layer named ip1 + relu1.

Types of Fusions#

The following table summarizes the main fusion patterns. Refer to the dropdown sections below for full descriptions and constraints.

Fusion

Pattern

Key constraints

ReLU Activation

ReLU → ReLU

Both activations must be ReLU

Convolution + ReLU

Conv → ReLU

Any conv type; activation must be ReLU

Convolution + GELU

Conv → GELU

FP16 or INT8 I/O; Turing+ GPU

Convolution + Clip

Conv → Clip

Any conv type; activation must be Clip

Scale + Activation

Scale → Activation

Fused into single activation

Convolution + ElementWise

Conv → sum/min/max ElementWise

No batch broadcast unless across batch size

Padding + Convolution

Padding → Conv/Deconv

Non-negative padding only

Shuffle + Reduce

Shuffle (permute only) → Reduce

keepDimensions required on Reduce

Shuffle + Shuffle

Shuffle → Shuffle

Reshape fusion only when transpose inverses match

Convolution + Scale

Conv → kUNIFORM/kCHANNEL Scale

Disabled if scale has non-constant power

Convolution + Generic Activation

Conv → pointwise activation

After pointwise fusion pass

Reduce → Pooling

Reduce (avg, CHW, keepDimensions) → Pooling

Replaces avg pooling layer

Convolution + Pooling

Conv (+ optional fused act) → Pooling

Same precision on both layers

Depthwise Separable Convolution

Depthwise conv+act → conv+act

INT8 only; compute capability ≥ 7.2

Softmax + Log / TopK

Softmax → Log or TopK

Optional Log fused into Softmax+TopK

GELU / L1Norm / L2Norm / LogSum / LogSumExp

Unary + ElementWise + Reduce chains

See reduction-operation dropdown below

The following list describes the types of supported fusions in detail.

Supported Layer Fusions
  • ReLU Activation: A single activation layer will replace an Activation layer performing ReLU followed by an activation performing ReLU.

  • Convolution and ReLU Activation: The Convolution layer can be of any type, and values are not restricted. The Activation layer must be of the ReLU type.

  • Convolution and GELU Activation: The input and output precision should be the same, with both of them FP16 or INT8. The Activation layer must be GELU type. Requires an NVIDIA Turing or later architecture and a supported CUDA toolkit (refer to Support Matrix for the TensorRT 11.0 minimum CUDA version).

  • Convolution and Clip Activation: The Convolution layer can be any type, and values are not restricted. The Activation layer must be Clip type.

  • Scale and Activation: The Scale layer, followed by an Activation layer, can be fused into a single Activation layer.

  • Convolution and ElementWise Operation: A Convolution layer followed by a simple sum, min, or max in an ElementWise layer can be fused into the Convolution layer. The sum must not use broadcasting unless the broadcasting is across the batch size.

  • Padding and Convolution/Deconvolution: If all the padding sizes are non-negative, padding followed by a Convolution or Deconvolution can be fused into a single Convolution/Deconvolution layer.

  • Shuffle and Reduce: A Shuffle layer without reshaping, followed by a Reduce layer, can be fused into a single Reduce layer. The Shuffle layer can perform permutations but cannot perform any reshape operation. The Reduce layer must have a keepDimensions set of dimensions.

  • Shuffle and Shuffle: Each Shuffle layer consists of a transpose, a reshape, and a second transpose. A Shuffle layer followed by another can be replaced by a single Shuffle (or nothing). If both Shuffle layers perform reshape operations, this fusion is only allowed if the second transpose of the first shuffle is the inverse of the first transpose of the second shuffle.

  • Scale: A Scale layer that adds 0, multiplied by 1, or computes powers to the 1 can be erased.

  • Convolution and Scale: Adjusting the convolution weights can fuse a convolution layer followed by a Scale layer that is kUNIFORM or kCHANNEL into a single convolution. This fusion is disabled if the scale has a non-constant power parameter.

  • Convolution and Generic Activation: This fusion happens after the pointwise fusion mentioned below. A pointwise with one input and output can be called a generic activation layer. A convolution layer followed by a generic activation layer can be fused into a single convolution layer.

  • Reduce: It performs average pooling, which a Pooling layer will replace. The Reduce layer must have a keepDimensions set and be reduced across H and W dimensions from the CHW input format before batching using the kAVG operation.

  • Convolution and Pooling: The Convolution and Pooling layers must have the same precision. The Convolution layer can already have a fused activation operation from a previous fusion.

  • Depthwise Separable Convolution: A depthwise convolution with activation followed by a convolution with activation can sometimes be fused into a single optimized DepSepConvolution layer. The precision of both convolutions must be INT8, and the device’s computation capability must be 7.2 or later.

  • Softmax and Log: If it has not already been fused with a previous log operation, it can be fused into a single Softmax layer.

  • Softmax and TopK: It can be fused into a single layer. The Softmax can optionally include a Log operation.

Supported Reduction Operation Fusions
  • GELU: A group of Unary and ElementWise layers representing the following equations can be fused into a single GELU reduction operation.

    \(0.5x\times \left( 1+\tanh\left( \frac{2}{\pi}\left( x+0.044715x^{3} \right) \right) \right)\)

    Or the alternative representation:

    \(0.5x \times \left( 1+erf\left( \frac{x}{\sqrt{2}} \right) \right)\)

  • L1Norm: A Unary layer kABS operation followed by a Reduce layer kSUM operation can be fused into a single L1Norm reduction operation.

  • Sum of Squares: A product ElementWise layer with the same input (square operation) followed by a kSUM reduction can be fused into a single square sum reduction operation.

  • L2Norm: A sum of squares operation followed by a kSQRT UnaryOperation can be fused into a single L2Norm reduction operation.

  • LogSum: A Reduce layer kSUM followed by a kLOG UnaryOperation can be fused into a single LogSum reduction operation.

  • LogSumExp: A Unary kEXP ElementWise operation followed by a LogSum fusion can be fused into a single LogSumExp reduction operation.

Pointwise Fusion#

Multiple adjacent Pointwise layers can be fused into a single Pointwise layer to improve performance.

The following types of Pointwise layers are supported, with some limitations:

  • Activation: Every ActivationType is supported.

  • Constant: Only constant with a single value (size == 1).

  • ElementWise: Every ElementWiseOperation is supported.

  • Pointwise: Pointwise itself is also a Pointwise layer.

  • Scale: Only support ScaleMode::kUNIFORM.

  • Unary: Every UnaryOperation is supported.

The size of the fused Pointwise layer is not unlimited, so some layers cannot be fused.

Fusion creates a new layer with a name consisting of both fused layers. For example, an ElementWise layer named add1 is fused with a ReLU Activation layer named relu1, creating a new layer named fusedPointwiseNode(add1, relu1).

Q/DQ Fusion#

Refer to the Explicit Quantization section for suggestions on optimizing INT8 and FP8 networks containing QuantizeLinear and DequantizeLinear layers.