TensorRT Release 7.x.x

TensorRT Release 7.1.3

Attention:

This is the TensorRT 7.1.3 GA release notes. For production use of TensorRT, we recommend using the TensorRT 7.1.3 build for CUDA 10.2. The CUDA 11.0 RC build is a Preview release for early testing and feedback on NVIDIA A100. This release is subject to change based on ongoing performance tuning and functional testing. For feedback, submit a bug on the NVIDIA Developer website.

These release notes are applicable to JetPack users of TensorRT unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 7.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.

Working with empty tensors

TensorRT supports empty tensors. A tensor is an empty tensor if it has one or more dimensions with length zero. Zero-length dimensions usually get no special treatment. If a rule works for a dimension of length L for an arbitrary positive value of L, it usually works for L=0 too. For more information, see Working With Empty Tensors in the TensorRT Developer Guide.

Builder layer timing cache

The layer timing cache will cache the layer profiling information during the builder phase. If there are other layers with the same input/output tensor configuration and layer params, then the TensorRT builder will skip profiling and reuse the cached result for the repeated layers. Models with many repeated layers (for example, BERT, WaveGlow, etc...) will see a significant speedup in builder time. The builder flag kDISABLE_TIMING_CACHE can be set if you want to disable this feature. For more information, see Builder Layer Timing Cache in the TensorRT Developer Guide and Initializing The Engine in the Best Practices For TensorRT Performance.

Pointwise fusion based on code generation

Pointwise fusion was introduced in TensorRT 6.0.1 to fuse multiple adjacent pointwise layers into one single layer. In this release, its implementation has been updated to use code generation and runtime compilation to further improve performance. The code generation and runtime compilation happen during execution plan building. For more information, see the TensorRT Best Practices Guide.

Dilation support for deconvolution

IDeconvolutionLayer now supports a dilation parameter. This is accessible through the C++ API, Python API, and the ONNX parser (see ConvTranspose). For more information, IDeconvolutionLayer in the TensorRT Developer Guide.

Selecting FP16 and INT8 kernels

TensorRT supports Mixed Precision Inference with FP32, FP16, or INT8 as supported precisions. Depending on the hardware support, you can choose to enable either of the above precision to accelerate inference. You can also choose to execute trtexec with the “--best” option directly, which would enable all supported precisions for inference resulting in best performance. For more information, see Mixed Precision in the Best Practices For TensorRT Performance.

Calibration with dynamic shapes

INT8 calibration with dynamic shapes supports the same functionality as a standard INT8 calibrator but for networks with dynamic shapes. You will need to provide a calibration optimization profile that would be used to set the dimensions for calibration. If a calibration optimization profile is not set, the first network optimization profile will be used as a calibration optimization profile. For more information, see INT8 Calibration With Dynamic Shapes in the TensorRT Developer Guide.

Algorithm selection

Algorithm selection provides a mechanism to select and report algorithms for different layers in a network. This can also be used to deterministically build TensorRT engine or to reproduce the same implementations for layers in the engine. For more information, see the Algorithm Selection and Determinism And Reproducibility In The Builder topics in the TensorRT Developer Guide.

INT8 calibration

The Legacy class IInt8LegacyCalibrator is un-deprecated. It is provided as a fallback option if the other calibrators yield poor results. A new kCALIBRATION_BEFORE_FUSION has been added which allows calibration before fusion. For more information, see INT8 Calibration Using C++ in the TensorRT Developer Guide.

Quantizing and dequantizing scale layers

A quantizing scale layer can be specified as a scale layer with output precision type of INT8. Similarly, a dequantizing scale layer can be specified as a scale layer with output precision type of FP32. Networks must be created with Explicit Precision mode to use these layers. Quantizing and dequantizing (QDQ) scale layers only support per-tensor quantization scales i.e. a single scale per tensor. Also, No shift weights are allowed for the QDQ scale layer as only symmetric quantization is supported. For more information, see Working With Explicit Precision Using C++ in the TensorRT Developer Guide.

Samples compilation

A new Makefile option TRT_STATIC=1 has been added which allows you to build the TensorRT samples with TensorRT and most dependent libraries statically linked into the sample binary.

Group normalization plugin

A new group normalization plugin has been added. For details on group normalization, refer to the Group Normalization paper.

TF32 support

TF32 is enabled by default for DataType::kFLOAT. On the NVIDIA Ampere architecture-based A100/GA100 GPU, TF32 can speed up networks using FP32, typically with no loss of accuracy. It combines FP32 dynamic range and format with FP16 precision. TF32 can be disabled via TensorRT or by setting the environment variable NVIDIA_TF32_OVERRIDE=0 when an engine is built. For more information and how to control TF32, see Enabling TF32 Inference Using C++ in the TensorRT Developer Guide. (not applicable for Jetson platforms)

New plugins
Added new plugins for common operators in the BERT model, including embedding layer normalization, skip layer normalization and multi-head attention.
embLayerNormPlugin
This plugin performs the following two tasks:
  • Embeds an input sequence consisting of token IDs and segment IDs. This consists of token embedding lookup, segment embedding lookup, adding positional embeddings and finally, layer normalization.
  • Preprocesses input masks that are used to mark valid input tokens in sequences that are padded to the target sequence length. It assumes contiguous input masks and encodes the masks as a single number denoting the number of valid elements. This plugin supports FP32 mode and FP16 mode.
skipLayerNormPlugin
This plugin adds a residual tensor, and applies layer normalization, meaning, transforming the mean and standard deviation to beta and gamma, respectively. Optionally, it can add a bias vector before layer normalization. This plugin supports FP32 mode, FP16 mode, and INT8 mode. It may bring a negative impact on the end-to-end prediction accuracy when running under INT8 mode.
bertQKVToContextPlugin
This plugin takes query, key, and value tensors and computes scaled multi-head attention, that is to compute scaled dot product attention scores SoftMax(K' * Q / sqrt(HeadSize)) and return values weighted by these attention scores. This plugin supports FP32 mode, FP16 mode, and INT8 mode. It is optimized for sequence lengths 128 and 384, and INT8 mode is only available for those sequence lengths.
These plugins only support GPUs with compute capability >= 7.0. For more information about these new BERT-related plugins, see TensorRT Open Source Plugins.

New sample
sampleAlgorithmSelector

sampleAlgorithmSelector shows an example of how to use the algorithm selection API based on sampleMNIST. This sample demonstrates the usage of IAlgorithmSelector to deterministically build TensorRT engines. It also shows the usage of IAlgorithmSelector::selectAlgorithms to define heuristics for selection of algorithms. For more information, see Algorithm Selection in the TensorRT Developer Guide, Algorithm Selection API Usage Example Based On sampleMNIST In TensorRT in the TensorRT Samples Support Guide.

onnx_packnet

onnx_packnet is a Python sample which uses TensorRT to perform inference with the PackNet network. PackNet is a self-supervised monocular depth estimation network used in autonomous driving. For more information, refer to TensorRT Inference Of ONNX Models With Custom Layers in the TensorRT Sample Support Guide.

Multi-Instance GPU (MIG)

Multi-instance GPU, or MIG, is a new feature in NVIDIA Ampere GPU architecture that enables user-directed partitioning of a single GPU into multiple smaller GPUs. This improves GPU utilization by enabling the GPU to be shared effectively by parallel compute workloads on bare metal, GPU pass through, or on multiple vGPUs. For more information, refer to Working With Multi-Instance GPU in the TensorRT Developer Guide. (not applicable for Jetson platforms)

Improved ONNX Resize operator support

The ONNX resize modes asymmetric, align_corners, half_pixel, and pytorch_half_pixel are now supported. For more information on these resize modes, see the ONNX Resize Operator Specification.

Compatibility

Limitations

  • TensorRT 7.1 only supports per-tensor quantization scales for both activations and weights in explicit precision mode. No shift weights are allowed for the QDQ scale layer as only symmetric quantization is supported. For more information, refer to the Working With Explicit Precision Using C++ in the TensorRT Developer Guide for more information.

Deprecated Features

The following features are deprecated in TensorRT 7.1.3:
  • The fc_plugin_caffe_mnist Python sample has been deprecated. The FCPlugin is not selected by fc_plugin_caffe_mnist which was intended to demonstrate its usage. This is because there is no default importer for FCPlugin in the Caffe parser.

  • Python 2.7 support has been deprecated. A warning will be emitted when you import the TensorRT bindings for Python 2.7. You should update your application to support Python 3.x to prevent issues with future TensorRT releases. In addition, the legacy Python bindings have been removed. You will need to migrate your application to the new Python bindings if you haven’t done so already. Refer to the Python Migration Guide for more information.

  • Support for CUDA Compute Capability version 3.0 has been removed. Support for CUDA Compute Capability versions 5.0 and lower may be removed in a future release. Specifically:
    CUDA Compute Capability Version Status
    Maxwell SM 5.0 (2014-2017):
    • GM10X - GeForce 745
    • GM10X - GeForce 750
    • GM10X - GeForce 830
    • GM10X - GeForce 840
    • Quadro K620
    • Quadro K1200
    • Quadro K2200
    • M5XX
    • M6XX
    • M1XXX
    • M2000
    Supported
    Kepler SM 3.7 (2014):
    • GK210 - K8
    Deprecated
    Kepler SM 3.5 (2013):
    • GK110 - K20
    • GeForce GTX 780 family
    • GTX Titan
    Deprecated
    Kepler SM 3.0 (2012):
    • GK10X GPUs
    • GeForce 600 series
    • K10
    • GRID K1/K2
    • Quadro K series
    Removed

  • Many methods of class IBuilder have been deprecated. The following table shows deprecated methods of class IBuilder that have replacements in IBuilder:
    Deprecated IBuilder Method IBuilder Replacement
    createNetwork() createNetworkV2(0)
    buildCudaEngine(network) buildEngineWithConfig(network,config)
    reset(network) reset()

    The next table shows the deprecated methods of IBuilder that have direct equivalents in class IBuilderConfig with the same name.
    Deprecated IBuilder Methods with Direct Equivalents in IBuilderConfig
    • setMaxWorkspaceSize
    • getMaxWorkspaceSize
    setInt8Calibrator
    • setDeviceType
    • getDeviceType
    • isDeviceTypeSet
    • resetDeviceType
    • setDefaultDeviceType
    • getDefaultDeviceType
    canRunOnDLA
    • setDLACore
    • getDLACore
    • setEngineCapability
    • getEngineCapability

    Timing methods in IBuilder also have replacements in IBuilderConfig, with new names.
    Deprecated IBuilder Method Replacement In IBuilderConfig
    setMinFindIterations setMinTimingIterations
    getMinFindIterations getMinTimingIterations
    setAverageFindIterations setAvgTimingIterations
    getAverageFindIterations getAvgTimingIterations

    Finally, some IBuilder methods related to boolean properties have been replaced with methods for setting/getting flags. For example, these calls on an IBuilder:
    builder.setHalf2Mode(true);
    builder.setInt8Mode(false);
    
    can be replaced with these calls on a IBuilderConfig:
    config.setFlag(BuilderFlag::kFP16);
    config.clearFlag(BuilderFlag::kINT8);

    The following table lists the deprecated methods and the corresponding flag.
    Deprecated IBuilder Method Corresponding Flag
    • setHalf2Mode
    • setFp16Mode
    • getHalf2Mode
    • getFp16Mode
    BuilderFlag::kFP16
    • setInt8Mode
    • getInt8Mode
    BuilderFlag::kINT8
    setDebugSync BuilderFlag::kDEBUG
    • setRefittable
    • getRefittable
    BuilderFlag::kREFIT
    • setStrictTypeConstraints
    • getStrictTypeConstraints
    BuilderFlag::kSTRICT_TYPES
    allowGPUFallback BuilderFlag::kGPU_FALLBACK

  • The INvPlugin creator function has been deprecated since TensorRT 5.1.x and has now been fully removed. We recommend that users upgrade their plugins to one of the later plugin interfaces, refer to Extending TensorRT With Custom Layers section in the TensorRT Developer Guide for more information.

Fixed Issues

  • Fixed memory leaks in engine serialization when UFF models are used.

  • Fixed a crash during engine build for networks with RNNv2 on Windows.

  • Statically linking with TensorRT library resulted in segfault in certain cases. The issue is now fixed.

  • Fixed multiple bugs related to dynamic shapes, specifically:
    • padding modes for convolution and deconvolution,
    • engines with multiple optimization profiles, and
    • empty tensors (tensors with zero volume).

Announcements

  • Boolean shape tensors now supported:

  • NVIDIA TensorRT Inference Server has been renamed to NVIDIA Triton Inference Server. For more information, refer to the Triton Inference Server documentation.

Known Issues

  • In the CUDA 11.0 RC release, there is a known performance regression on some RNN networks:
    • up to 50% on Turing GPUs
    • up to 12% on Pascal and Volta GPUs

  • There is known performance regression between 30-80% for networks like ResNet-50 and MobileNet when run in FP16 mode on SM50 devices.

  • The Windows library size is 600 MB bigger than the Linux library size. This will be fixed in the next release.

  • Static compiling of samples with the CentOS7 CUDA 11.0 RC build fails.

  • There is a known performance regressions on P100:
    • 50-100% regression on 3D networks like 3D U-Net
    • 17% on Xception in FP16 mode

  • There is a known performance regression for Inception V3 and V4 networks in FP32 mode:
    • up to 60% on V100
    • up to 15% on RTX6000

  • Some fusions are not enabled on Windows with CUDA 11. This would mean performance loss of around 10% for networks like YOLO3.

  • The diagram in IRNNv2Layer is incorrect. This will be fixed in a future release.

  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit the weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().

  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.

  • Some fusions are not enabled in the following cases:
    • Windows with CUDA 11
    • When the static library is used
    This means there is a performance loss of around 10% for networks like BERT and YOLO3. The performance loss depends on precision used and batch size and it can be up to 60% in some cases.

  • Loops and DataType::kBOOL are not supported when the static TensorRT library is used.

  • There is an error in the config.py file included in the sampleUffFasterRCNN sample. Specifically, line 34 in the config file should be changed from: dynamic_graph.remove('input_2') todynamic_graph.remove(dynamic_graph.find_nodes_by_name('input_2'))

  • Updated: June 25, 2020
    When using an RPM file on RedHat for installation, installing cuDNN v8 directly or via TensorRT 7.1.3 will enable users to build their application with cuDNN v8. However, in order for the user to compile an application with cuDNN v7 after cuDNN v8 is installed, the user will need to perform the following steps:
    1. Issue sudo mv /usr/include/cudnn.h /usr/include/cudnn_v8.h.
    2. Issue sudo ln -s /etc/alternatives/libcudnn /usr/include/cudnn.h.
    3. Switch to cuDNN v7 by issuing sudo update-alternatives --config libcudnn and choose cuDNN v7 from the list.

    Steps 1 and 2 are required for the user to be able to switch between v7 and v8 installations. After steps 1 and 2 are performed once, step 3 can be used repeatedly and the user can choose the appropriate cuDNN version to work with. For more information, refer to the Installing From An RPM File and Upgrading From v7 To v8 sections in the cuDNN Installation Guide.

TensorRT Release 7.1.0 Early Access (EA)

These are the TensorRT 7.1.0 Early Access (EA) release notes and are applicable to NVIDIA® Jetson™ Linux for Tegra™ users. This release includes several fixes from the previous TensorRT 6.0.0 and later releases as well as the following additional changes. These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This preview release is for early testing and feedback, therefore, for production use of TensorRT, continue to use TensorRT 7.0.0.

For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
Working with empty tensors

TensorRT supports empty tensors. A tensor is an empty tensor if it has one or more dimensions with length zero. Zero-length dimensions usually get no special treatment. If a rule works for a dimension of length L for an arbitrary positive value of L, it usually works for L=0 too. For more information, see Working With Empty Tensors in the TensorRT Developer Guide.

Builder layer timing cache

The layer timing cache will cache the layer profiling information during the builder phase. If there are other layers with the same input/output tensor configuration and layer params, then the TensorRT builder will skip profiling and reuse the cached result for the repeated layers. Models with many repeated layers (for example, BERT, WaveGlow, etc...) will see a significant speedup in builder time. The builder flag kDISABLE_TIMING_CACHE can be set if you want to disable this feature. For more information, see Builder Layer Timing Cache in the TensorRT Developer Guide and Initializing The Engine in the Best Practices For TensorRT Performance.

Pointwise fusion based on code generation

Pointwise fusion was introduced in TensorRT 6.0.1 to fuse multiple adjacent pointwise layers into one single layer. In this release, its implementation has been updated to use code generation and runtime compilation to further improve performance. The code generation and runtime compilation happen during execution plan building. For more information, see the TensorRT Best Practices Guide.

Dilation support for deconvolution

IDeconvolutionLayer now supports a dilation parameter. This is accessible through the C++ API, Python API, and the ONNX parser (see ConvTranspose). For more information, IDeconvolutionLayer in the TensorRT Developer Guide.

Selecting FP16 and INT8 kernels

TensorRT supports Mixed Precision Inference with FP32, FP16, or INT8 as supported precisions. Depending on the hardware support, you can choose to enable either of the above precision to accelerate inference. You can also choose to execute trtexec with the “--best” option directly, which would enable all supported precisions for inference resulting in best performance. For more information, see Mixed Precision in the Best Practices For TensorRT Performance.

Calibration with dynamic shapes

INT8 calibration with dynamic shapes supports the same functionality as a standard INT8 calibrator but for networks with dynamic shapes. You will need to provide a calibration optimization profile that would be used to set the dimensions for calibration. If a calibration optimization profile is not set, the first network optimization profile will be used as a calibration optimization profile. For more information, see INT8 Calibration With Dynamic Shapes in the TensorRT Developer Guide.

Algorithm selection

Algorithm selection provides a mechanism to select and report algorithms for different layers in a network. This can also be used to deterministically build TensorRT engine or to reproduce the same implementations for layers in the engine. For more information, see the Algorithm Selection and Determinism And Reproducibility In The Builder topics in the TensorRT Developer Guide.

New sample

sampleAlgorithmSelector shows an example of how to use the algorithm selection API based on sampleMNIST. This sample demonstrates the usage of IAlgorithmSelector to deterministically build TensorRT engines. It also shows the usage of IAlgorithmSelector::selectAlgorithms to define heuristics for selection of algorithms. For more information, see Algorithm Selection in the TensorRT Developer Guide, Algorithm Selection API Usage Example Based On sampleMNIST In TensorRT in the TensorRT Samples Support Guide.

Compatibility

Deprecated Features

The following features are deprecated in TensorRT 7.1.0:
  • Python 2.7 support has been deprecated. A warning will be emitted when you import the TensorRT bindings for Python 2.7. You should update your application to support Python 3.x to prevent issues with future TensorRT releases. In addition, the legacy Python bindings have been removed. You will need to migrate your application to the new Python bindings if you haven’t done so already. Refer to the Python Migration Guide for more information.

  • Support for CUDA Compute Capability version 3.0 has been removed. Support for CUDA Compute Capability versions 5.0 and lower may be removed in a future release. Specifically:
    CUDA Compute Capability Version Status
    Maxwell SM 5.0 (2014-2017):
    • GM10X - GeForce 745
    • GM10X - GeForce 750
    • GM10X - GeForce 830
    • GM10X - GeForce 840
    • Quadro K620
    • Quadro K1200
    • Quadro K2200
    • M5XX
    • M6XX
    • M1XXX
    • M2000
    Supported
    Kepler SM 3.7 (2014):
    • GK210 - K8
    Deprecated
    Kepler SM 3.5 (2013):
    • GK110 - K20
    • GeForce GTX 780 family
    • GTX Titan
    Deprecated
    Kepler SM 3.0 (2012):
    • GK10X GPUs
    • GeForce 600 series
    • K10
    • GRID K1/K2
    • Quadro K series
    Removed
  • Many methods of class IBuilder have been deprecated. The following table shows deprecated methods of class IBuilder that have replacements in IBuilder:
    Deprecated IBuilder Method IBuilder Replacement
    createNetwork() createNetworkV2(0)
    buildCudaEngine(network) buildEngineWithConfig(network,config)
    reset(network) reset()
    The next table shows the deprecated methods of IBuilder that have direct equivalents in class IBuilderConfig with the same name.
    Deprecated IBuilder Methods with Direct Equivalents in IBuilderConfig
    • setMaxWorkspaceSize
    • getMaxWorkspaceSize
    setInt8Calibrator
    • setDeviceType
    • getDeviceType
    • isDeviceTypeSet
    • resetDeviceType
    • setDefaultDeviceType
    • getDefaultDeviceType
    canRunOnDLA
    • setDLACore
    • getDLACore
    • setEngineCapability
    • getEngineCapability
    Timing methods in IBuilder also have replacements in IBuilderConfig, with new names.
    Deprecated IBuilder Method Replacement In IBuilderConfig
    setMinFindIterations setMinTimingIterations
    getMinFindIterations getMinTimingIterations
    setAverageFindIterations setAvgTimingIterations
    getAverageFindIterations getAvgTimingIterations
    Finally, some IBuilder methods related to boolean properties have been replaced with methods for setting/getting flags. For example, these calls on an IBuilder:
    builder.setHalf2Mode(true);
    builder.setInt8Mode(false);
    
    can be replaced with these calls on a IBuilderConfig:
    config.setFlag(BuilderFlag::kFP16);
    config.clearFlag(BuilderFlag::kINT8);
    The following table lists the deprecated methods and the corresponding flag.
    Deprecated IBuilder Method Corresponding Flag
    • setHalf2Mode
    • setFp16Mode
    • getHalf2Mode
    • getFp16Mode
    BuilderFlag::kFP16
    • setInt8Mode
    • getInt8Mode
    BuilderFlag::kINT8
    setDebugSync BuilderFlag::kDEBUG
    • setRefittable
    • getRefittable
    BuilderFlag::kREFIT
    • setStrictTypeConstraints
    • getStrictTypeConstraints
    BuilderFlag::kSTRICT_TYPES
    allowGPUFallback BuilderFlag::kGPU_FALLBACK
  • The INvPlugin creator function has been deprecated since TensorRT 5.1.x and has now been fully removed. We recommend that users upgrade their plugins to one of the later plugin interfaces, refer to Extending TensorRT With Custom Layers section in the TensorRT Developer Guide for more information.

Fixed Issues

  • DLA has restrictions on usage that were previously undocumented. Some programs that might have worked, but violated these restrictions, are now expected to fail at build time. For more information, see Restrictions With DLA and FAQs in the TensorRT Developer Guide.

Announcements

Known Issues

  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit the weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().

  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.

TensorRT Release 7.0.0

These are the TensorRT 7.0.0 release notes for Linux and Windows users. This release includes fixes from the previous TensorRT 6.0.1 release as well as the following additional changes. These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

For previous TensorRT release notes, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
Working with loops

TensorRT supports loop-like constructs, which can be useful for recurrent networks. TensorRT loops support scanning over input tensors, recurrent definitions of tensors, and both “scan outputs” and “last value” outputs. For more information, see Working With Loops in the TensorRT Developer Guide.

ONNX parser with dynamic shapes support

The ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set. For more information, see Importing An ONNX Model Using The C++ Parser API and Working With Dynamic Shapes in the TensorRT Developer Guide for more information.

TensorRT container with OSS

The TensorRT monthly container release now contains pre-built binaries from the TensorRT Open Source Repository. For more information, refer to the monthly released TensorRT Container Release Notes starting in 19.12+.

BERT INT8 and mixed precision optimizations

Some GEMM layers are now followed by GELU activation in the BERT model. Since TensorRT doesn’t have IMMA GEMM layers, you can implement those GEMM layers in the BERT network with either IConvolutionLayer or IFullyConnectedLayer layers depending on what precision you require. For example, you can leverage IConvolutionLayer with H == W == 1 (CONV1x1) to implement a FullyConnected operation and leverage IMMA math under INT8 mode. TensorRT supports the fusion of Convolution/FullyConnected and GELU. For more information, refer to TensorRT Best Practices Guide and Adding Custom Layers Using The C++ API in the TensorRT Developer Guide.

Working with Quantized Networks

TensorRT now supports quantized models trained with Quantization Aware Training. Support is limited to symmetrically quantized models, meaning zero_point = 0 using QuantizeLinear and DequantizeLinear ONNX ops. For more information, see Working With Quantized Networks in the TensorRT Developer Guide and QDQ Fusions in the Best Practices For TensorRT Performance Guide.

New layers
IFillLayer

The IFillLayer is used to generate an output tensor with the specified mode. For more information, see the C++ class IFillLayer or the Python class IFillLayer.

IIteratorLayer

The IIteratorLayer enables a loop to iterate over a tensor. A loop is defined by loop boundary layers. For more information, see the C++ class IIteratorLayer or the Python class IIteratorLayer and Working With Loops in the TensorRT Developer Guide.

ILoopBoundaryLayer

Class ILoopBoundaryLayer defines a virtual method getLoop() that returns a pointer to the associated ILoop. For more information, see the C++ class ILoopBoundaryLayer or the Python class ILoopBoundaryLayer and Working With Loops in the TensorRT Developer Guide.

ILoopOutputLayer

The ILoopOutputLayer specifies an output from the loop. For more information, see the C++ class ILoopOutputLayer or the Python class ILoopOutputLayer and Working With Loops in the TensorRT Developer Guide.

IParametricReluLayer

The IParametricReluLayer represents a parametric ReLU operation, meaning, a leaky ReLU where the slopes for x < 0 can be different for each element. For more information, see the C++ class IParametricReluLayer or the Python class IParametricReluLayer.

IRecurrenceLayer

The IRecurrenceLayer specifies a recurrent definition. For more information, see the C++ class IRecurrenceLayer or the Python class IRecurrenceLayer and Working With Loops in the TensorRT Developer Guide.

ISelectLayer

The ISelectLayer returns either of the two inputs depending on the condition. For more information, see the C++ class ISelectLayer or the Python class ISelectLayer.

ITripLimitLayer

The ITripLimitLayer specifies how many times the loop iterates. For more information, see the C++ class ITripLayer or the Python class ITripLayer and Working With Loops in the TensorRT Developer Guide.

New operations

ONNX: Added ConstantOfShape, DequantizeLinear, Equal, Erf, Expand, Greater, GRU, Less, Loop, LRN, LSTM, Not, PRelu, QuantizeLinear, RandomUniform, RandomUniformLike, Range, RNN, Scan, Sqrt, Tile, and Where.

For more information, see the full list of Supported Ops in the Support Matrix.

Boolean tensor support

TensorRT supports boolean tensors which can be marked as network input and output. IElementWiseLayer, IUnaryLayer (only kNOT), IShuffleLayer, ITripLimit (only kWHILE) and ISelectLayer support the boolean datatype. Boolean tensors can be used only with FP32 and FP16 precision networks. For more information, refer to the Layers section in the TensorRT Developer Guide.

Compatibility

Limitations

  • UFF samples, such as sampleUffMNIST, sampleUffSSD, sampleUffPluginV2Ext, sampleUffMaskRCNN, sampleUffFasterRCNN, uff_custom_plugin, and uff_ssd, support TensorFlow 1.x and not models trained with TensorFlow 2.0.

  • Loops and DataType::kBOOL are supported on limited platforms. On platforms without loop support, INetworkDefinition::addLoop returns nullptr. Attempting to build an engine using operations that consume or produce DataType::kBOOL on a platform without support, results in validation rejecting the network. For details on which platforms are supported with loops, refer to the Features For Platforms And Software section in the TensorRT Support Matrix.

  • Explicit precision networks with quantized and de-quantized nodes are only supported on devices with hardware INT8 support. Running on devices without hardware INT8 support results in undefined behavior.

Deprecated Features

The following features are deprecated in TensorRT 7.0.0:
  • Backward Compatibility and Deprecation Policy - When a new function, for example foo, is first introduced, there is no explicit version in the name and the version is assumed to be 1. When changing the API of an existing TensorRT function foo (usually to support some new functionality), first, a new routine fooV<N> is created where N represents the Nth version of the function and the previous version fooV<N-1> remains untouched to ensure backward compatibility. At this point, fooV<N-1> is considered deprecated, and should be treated as such by users of TensorRT.

    Starting with TensorRT 7, we will be eliminating deprecated API per the following policy.
    • APIs already marked deprecated prior to TensorRT 7 (6 and older) will be removed in the next major release of TensorRT 8.
    • APIs deprecated in TensorRT <M>, where M is the major version greater than or equal to 7, will be removed in TensorRT <M+2>. This means that deprecated APIs remain functional for two major releases before they are removed.
  • Deprecation of Caffe Parser and UFF Parser - We are deprecating Caffe Parser and UFF Parser in TensorRT 7. They will be tested and functional in the next major release of TensorRT 8, but we plan to remove the support in the subsequent major release. Plan to migrate your workflow to use tf2onnx, keras2onnx or TensorFlow-TensorRT (TF-TRT) for deployment.

Fixed Issues

  • You no longer have to build ONNX and TensorFlow from source in order to workaround pybind11 compatibility issues. The TensorRT Python bindings are now built using pybind11 version 2.4.3.

  • Windows users are now able to build applications designed to use the TensorRT refittable engine feature. The issue related to unresolved symbols has been resolved.

  • A virtual destructor has been added to the IPluginFactory class.

Known Issues

  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so an attempt to refit the weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().

  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.

  • The INT8 calibration does not work with dynamic shapes. To workaround this issue, ensure there are two passes in the code:
    1. Using a fixed shape input to build the engine in the first pass, allows TensorRT to generate the calibration cache.
    2. Then, create the engine again using the dynamic shape input and the builder will reuse the calibration cache generated in the first pass.