TensorRT Release 7.x.x

TensorRT Release 7.0.0

These are the TensorRT 7.0.0 release notes for Linux and Windows users. This release includes fixes from the previous TensorRT 6.0.1 release as well as the following additional changes. These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

For previous TensorRT release notes, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
Working with loops

TensorRT supports loop-like constructs, which can be useful for recurrent networks. TensorRT loops support scanning over input tensors, recurrent definitions of tensors, and both “scan outputs” and “last value” outputs. For more information, see Working With Loops in the TensorRT Developer Guide.

ONNX parser with dynamic shapes support

The ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set. For more information, see Importing An ONNX Model Using The C++ Parser API and Working With Dynamic Shapes in the TensorRT Developer Guide for more information.

TensorRT container with OSS

The TensorRT monthly container release now contains pre-built binaries from the TensorRT Open Source Repository. For more information, refer to the monthly released TensorRT Container Release Notes starting in 19.12+.

BERT INT8 and mixed precision optimizations

Some GEMM layers are now followed by GELU activation in the BERT model. Since TensorRT doesn’t have IMMA GEMM layers, you can implement those GEMM layers in the BERT network with either IConvolutionLayer or IFullyConnectedLayer layers depending on what precision you require. For example, you can leverage IConvolutionLayer with H == W == 1 (CONV1x1) to implement a FullyConnected operation and leverage IMMA math under INT8 mode. TensorRT supports the fusion of Convolution/FullyConnected and GELU. For more information, refer to TensorRT Best Practices Guide and Adding Custom Layers Using The C++ API in the TensorRT Developer Guide.

Working with Quantized Networks

TensorRT now supports quantized models trained with Quantization Aware Training. Support is limited to symmetrically quantized models, meaning zero_point = 0 using QuantizeLinear and DequantizeLinear ONNX ops. For more information, see Working With Quantized Networks in the TensorRT Developer Guide and QDQ Fusions in the Best Practices For TensorRT Performance Guide.

New layers
IFillLayer

The IFillLayer is used to generate an output tensor with the specified mode. For more information, see the C++ class IFillLayer or the Python class IFillLayer.

IIteratorLayer

The IIteratorLayer enables a loop to iterate over a tensor. A loop is defined by loop boundary layers. For more information, see the C++ class IIteratorLayer or the Python class IIteratorLayer and Working With Loops in the TensorRT Developer Guide.

ILoopBoundaryLayer

Class ILoopBoundaryLayer defines a virtual method getLoop() that returns a pointer to the associated ILoop. For more information, see the C++ class ILoopBoundaryLayer or the Python class ILoopBoundaryLayer and Working With Loops in the TensorRT Developer Guide.

ILoopOutputLayer

The ILoopOutputLayer specifies an output from the loop. For more information, see the C++ class ILoopOutputLayer or the Python class ILoopOutputLayer and Working With Loops in the TensorRT Developer Guide.

IParametricReluLayer

The IParametricReluLayer represents a parametric ReLU operation, meaning, a leaky ReLU where the slopes for x < 0 can be different for each element. For more information, see the C++ class IParametricReluLayer or the Python class IParametricReluLayer.

IRecurrenceLayer

The IRecurrenceLayer specifies a recurrent definition. For more information, see the C++ class IRecurrenceLayer or the Python class IRecurrenceLayer and Working With Loops in the TensorRT Developer Guide.

ISelectLayer

The ISelectLayer returns either of the two inputs depending on the condition. For more information, see the C++ class ISelectLayer or the Python class ISelectLayer.

ITripLimitLayer

The ITripLimitLayer specifies how many times the loop iterates. For more information, see the C++ class ITripLayer or the Python class ITripLayer and Working With Loops in the TensorRT Developer Guide.

New operations

ONNX: Added ConstantOfShape, DequantizeLinear, Equal, Erf, Expand, Greater, GRU, Less, Loop, LRN, LSTM, Not, PRelu, QuantizeLinear, RandomUniform, RandomUniformLike, Range, RNN, Scan, Sqrt, Tile, and Where.

For more information, see the full list of Supported Ops in the Support Matrix.

Boolean tensor support

TensorRT supports boolean tensors which can be marked as network input and output. IElementWiseLayer, IUnaryLayer (only kNOT), IShuffleLayer, ITripLimit (only kWHILE) and ISelectLayer support the boolean datatype. Boolean tensors can be used only with FP32 and FP16 precision networks. For more information, refer to the Layers section in the TensorRT Developer Guide.

Compatibility

Limitations

  • UFF samples, such as sampleUffMNIST, sampleUffSSD, sampleUffPluginV2Ext, sampleUffMaskRCNN, sampleUffFasterRCNN, uff_custom_plugin, and uff_ssd, support TensorFlow 1.x and not models trained with TensorFlow 2.0.

  • Loops and DataType::kBOOL are supported on limited platforms. On platforms without loop support, INetworkDefinition::addLoop returns nullptr. Attempting to build an engine using operations that consume or produce DataType::kBOOL on a platform without support, results in validation rejecting the network. For details on which platforms are supported with loops, refer to the Features For Platforms And Software section in the TensorRT Support Matrix.

  • Explicit precision networks with quantized and de-quantized nodes are only supported on devices with hardware INT8 support. Running on devices without hardware INT8 support results in undefined behavior.

Deprecated Features

The following features are deprecated in TensorRT 7.0.0:
  • Backward Compatibility and Deprecation Policy - When a new function, for example foo, is first introduced, there is no explicit version in the name and the version is assumed to be 1. When changing the API of an existing TensorRT function foo (usually to support some new functionality), first, a new routine fooV<N> is created where N represents the Nth version of the function and the previous version fooV<N-1> remains untouched to ensure backward compatibility. At this point, fooV<N-1> is considered deprecated, and should be treated as such by users of TensorRT.

    Starting with TensorRT 7, we will be eliminating deprecated API per the following policy.
    • APIs already marked deprecated prior to TensorRT 7 (6 and older) will be removed in the next major release of TensorRT 8.
    • APIs deprecated in TensorRT <M>, where M is the major version greater than or equal to 7, will be removed in TensorRT <M+2>. This means that deprecated APIs remain functional for two major releases before they are removed.
  • Deprecation of Caffe Parser and UFF Parser - We are deprecating Caffe Parser and UFF Parser in TensorRT 7. They will be tested and functional in the next major release of TensorRT 8, but we plan to remove the support in the subsequent major release. Plan to migrate your workflow to use tf2onnx, keras2onnx or TensorFlow-TensorRT (TF-TRT) for deployment.

Fixed Issues

  • You no longer have to build ONNX and TensorFlow from source in order to workaround pybind11 compatibility issues. The TensorRT Python bindings are now built using pybind11 version 2.4.3.

  • Windows users are now able to build applications designed to use the TensorRT refittable engine feature. The issue related to unresolved symbols has been resolved.

  • A virtual destructor has been added to the IPluginFactory class.

Known Issues

  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so an attempt to refit the weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().

  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.

  • The INT8 calibration does not work with dynamic shapes. To workaround this issue, ensure there are two passes in the code:
    1. Using a fixed shape input to build the engine in the first pass, allows TensorRT to generate the calibration cache.
    2. Then, create the engine again using the dynamic shape input and the builder will reuse the calibration cache generated in the first pass.