TensorRT Release 7.x.x

TensorRT Release 7.2.2

These are the TensorRT 7.2.2 release notes and are applicable to Windows and Linux x86 users.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 7.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added support for Python 3.8. The Linux tar packages now include TensorRT Python binding wheel files that support Python 3.8.
    Note: TensorFlow 1.15.x does not support Python 3.8. Continue to use an earlier Python version if you require UFF support. Updating the TensorRT samples to support TensorFlow 2.x will be done in a future release.
  • Added the following debugging tools:
    Note: Although these tools are shipped with TensorRT, their utility extends beyond the TensorRT workflow.
    ONNX GraphSurgeon API Reference
    ONNX GraphSurgeon provides a convenient way to create and modify ONNX models. For more information, see ONNX GraphSurgeon API Reference.
    Polygraphy API Reference
    Polygraphy is a toolkit designed to assist in running and debugging deep learning models in various frameworks. For more information, refer to the Polygraphy API.
    PyTorch-Quantization Toolkit User Guide
    PyTorch-Quantization is a toolkit for training and evaluating PyTorch models with simulated quantization. Quantization can be added to the model automatically, or manually, allowing the model to be tuned for accuracy and performance. The quantized model can be exported to ONNX and imported to an upcoming version of TensorRT. For more information, refer to the PyTorch-Quantization Toolkit User Guide.
  • Added instructions and a list of limitations for how to build the TensorRT samples using the TensorRT static libraries, including cuDNN and other CUDA libraries that are statically linked. For more information, refer to the Building Samples Using Static Libraries section in the Working With TensorRT Samples.
  • Added a Quick Start Guide. This guide is a starting point for users who want to try out TensorRT; specifically, this document enables users to quickly deploy and run inference on a finished TensorRT engine.

Compatibility

  • TensorRT 7.2.2 has been tested with the following:
  • This TensorRT release supports CUDA 10.2, 11.0 update 1, 11.1 update 1, and 11.2.
    Note: If you are developing an application that is being compiled with CUDA 11.2 or you are using CUDA 11.2 libraries to run your application, then you must install CUDA 11.1 using either the Debian/RPM packages or using a CUDA 11.1 tar/zip/exe package. NVRTC from CUDA 11.1 is a runtime requirement of TensorRT and must be present to run TensorRT applications. If you are using the network repo installation method, this additional step is not needed.
  • It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section in the TensorRT Support Matrix. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Limitations

  • TensorRT 7.2 only supports per-tensor quantization scales for both activations and weights in explicit precision mode. No shift weights are allowed for the QDQ scale layer as only symmetric quantization is supported. For more information, refer to the Working With Explicit Precision Using C++ in the TensorRT Developer Guide for more information.
  • Loops and DataType::kBOOL are not supported when the static TensorRT library is used.
  • When using reformat-free I/O, the extent of a tensor in a vectorized dimension might not be a multiple of the vector length. Elements in a partially occupied vector that are not within the tensor are referred to here as vector-padding. For example:
    • On GPU
      • for input tensors, the application shall set vector-padding elements to zero.
      • for output tensors, the value of vector-padding elements is undefined. In a future release, TensorRT will support setting them to zero.
    • On DLA
      • for input tensors, vector-padding elements are ignored.
      • for output tensors, vector-padding elements are unmodified.
  • When running INT8 networks on DLA using TensorRT, operations must be added to the same subgraph to reduce quantization errors across the subgraph of the network that runs on the DLA by allowing them to fuse and retain higher precision for intermediate results. Breaking apart the subgraph in order to inspect intermediate results by setting the tensors as network output tensors, can result in different levels of quantization errors due to these optimizations being disabled.
  • There is a known issue that TensorRT selects kLINEAR format when the user uses reformat-free I/O with vectorized formats and with input/output tensors which have only 3 dimensions. The workaround is to add an additional dimension to the tensors with size 1 to make them 4 dimensional tensors.
  • The IExecutionContext contains shared resources, therefore, calling enqueue or enqueueV2 in from the same IExecutionContext object with different CUDA streams concurrently results in undefined behavior. To perform inference concurrently in multiple CUDA streams, use one IExecutionContext per CUDA stream.

Deprecated Features

The following features are deprecated in TensorRT 7.2.2:

Fixed Issues

  • If you had started with a clean system installation and you had not installed the CUDA Toolkit prior to installing the TensorRT samples, then you may of needed to manually install cuda-nvcc-XX-Y and cuda-nvprof-XX-Y, where XX-Y matches the CUDA major and minor version for your desired setup. Without these additional packages, you may have encountered compile errors while building the TensorRT samples. This issue has been fixed in this release.
  • There was up to 23% performance regression on Volta GPUs for some RNN networks. This issue has been fixed in this release. (not applicable for Jetson platforms)
  • There was a known accuracy issue of 3D U-Net networks on NVIDIA Ampere GPUs where TF32 mode is enabled by default. This issue has been fixed in this release.

Announcements

  • Support for Python 2 will be dropped in a future TensorRT release. This means that TensorRT will no longer include wheels for Python 2, and Python samples will not work with Python 2. Ensure you migrate your application to Python version 3.

Known Issues

  • The diagram in IRNNv2Layer is incorrect. This will be fixed in a future release.
  • There is a known issue that graph capture may fail in some cases for IExecutionContext::enqueue() and IExecutionContext::enqueueV2(). For more information, refer to the documentation for IExecutionContext::enqueueV2(), including how to work around this issue.
  • Some fusions are not enabled when the static library is used. This means there is a performance loss of around 10% for networks like BERT and YOLO3 when linking with the static library compared to the dynamic library. The performance loss depends on precision used and batch size and it can be up to 60% in some cases.
  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit those weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().
  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.
  • Convolution layers with dynamic shapes and large range of possible index dimensions in the profile have a known build time performance issue. This can be bypassed by using IAlgorithmSelector and disabling cudnnConvolution tactics.
  • There is a known performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on Maxwell and Pascal GPUs.
  • There is a known performance regression compared to TensorRT 7.1 for Convolution layers with kernel size greater than 5x5. For example, it can lead up to 35% performance regression of the VGG16 UFF model compared to TensorRT 7.1. This will be fixed in a future release.
  • If the network contains an ElementWise layer, where one operand is a constant and the const rank is > 4, there is going to be a fusion to the Scale layer which doesn't support per element scale. The case can be seen for the Convolution and FullyConnected layers with bias where ONNX decomposes the bias to be ElementWise. To workaround this issue, add an Identity layer between the ElementWise and the const to prevent the fusion.
  • Due to limitations with how requirements can be specified with the RPM version supported by RHEL/CentOS 7.x, the cuBLAS development package from CUDA 11.1 is required when you are developing applications using TensorRT and CUDA 11.2. Your build environment can reference cuBLAS 11.2 without issues; this is only a packaging issue. This issue will be resolved with future CUDA versions. Ubuntu does not have this limitation, therefore, cuBLAS 11.1 is not required for CUDA 11.2 development on those OS’s.
  • Some RNN networks such as Cortana with FP32 precision and batch size of 8 or higher have up to a 20% performance loss with CUDA 11.0 or higher compared to CUDA 10.2.
  • You must have libcublas.so.* present on your system while running an application linked with the TensorRT static library. TensorRT now links to cuBLAS using dlopen() rather than at compiler link time for both the dynamic and static libraries. A solution to this problem will be worked out in a future release so that cuBLAS can be statically linked once again for applications which require the TensorRT static library.

TensorRT Release 7.2.1

These are the TensorRT 7.2.1 release notes and are applicable to Linux x86, Windows x64 and Linux ARM Server Base System Architecture (SBSA) users.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 7.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added support for CUDA 11.1 and GeForce devices with compute capability version 8.6.
  • Added support for Linux ARM Server Base System Architecture (SBSA) users on Ubuntu 18.04.
  • Added instructions for installing TensorRT from a pip wheel file. For step-by-step instructions, refer to the pip Wheel File Installation section in the TensorRT Installation Guide.

Compatibility

  • TensorRT 7.2.1 has been tested with the following:
  • This TensorRT release supports CUDA 10.2, 11.0 update 1, and 11.1.
  • It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section in the TensorRT Support Matrix. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Limitations

  • TensorRT 7.2 only supports per-tensor quantization scales for both activations and weights in explicit precision mode. No shift weights are allowed for the QDQ scale layer as only symmetric quantization is supported. For more information, refer to the Working With Explicit Precision Using C++ in the TensorRT Developer Guide for more information.
  • Replace IRNNLayer and IRNNv2Layer with loops. IRNNLayer was deprecated in TensorRT 4.0 and will be removed in TensorRT 8.0. IRNNv2Layer was deprecated in TensorRT 7.2.1 and will be removed in TensorRT 9.0. Use the loop API to synthesize a recurrent subnetwork. For an example, see sampleCharRNN sample, method SampleCharRNNLoop::addLSTMCell. The loop API lets you express general recurrent networks instead of being limited to the prefabricated cells in IRNNLayer and IRNNv2Layer.
  • When using reformat-free I/O, the extent of a tensor in a vectorized dimension might not be a multiple of the vector length. Elements in a partially occupied vector that are not within the tensor are referred to here as vector-padding. For example:
    • On GPU
      • for input tensors, the application shall set vector-padding elements to zero.
      • for output tensors, the value of vector-padding elements is undefined. In a future release, TensorRT will support setting them to zero.
    • On DLA
      • for input tensors, vector-padding elements are ignored.
      • for output tensors, vector-padding elements are unmodified.
  • Loops and DataType::kBOOL are not supported when the static TensorRT library is used.
  • When running INT8 networks on DLA using TensorRT, operations must be added to the same subgraph to reduce quantization errors across the subgraph of the network that runs on the DLA by allowing them to fuse and retain higher precision for intermediate results. Breaking apart the subgraph in order to inspect intermediate results by setting the tensors as network output tensors, can result in different levels of quantization errors due to these optimizations being disabled.
  • There is a known issue that TensorRT selects kLINEAR format when the user uses reformat-free I/O with vectorized formats and with input/output tensors which have only 3 dimensions. The workaround is to add an additional dimension to the tensors with size 1 to make them 4 dimensional tensors.

Deprecated Features

The following features are deprecated in TensorRT 7.2.1:

Fixed Issues

  • A symbol conflict between the cuBLAS static library and the TensorRT plugin static library has been resolved. The Logger class used internally by the TensorRT plugin library has been moved to a namespace to avoid symbol conflicts. You may experience unexpected crashes during initialization or when exiting your application if linking with TensorRT static libraries prior to this fix.
  • There was a known performance regression on P100:
    • 30% regression on 3D networks like 3D U-Net in FP32 mode
    This issue has been fixed in this release. (not applicable for Jetson platforms)
  • For Windows users with CUDA 11.0, some fusions were not enabled. This means there was a performance loss of around 10% - 60% for networks like BERT and YOLO3. The performance loss depends on the precision used and batch size. This issue has been fixed in this release.
  • There was up to a 10% performance regression for Inception V4 networks in FP32 mode on P100 and V100. This issue has been fixed in this release. (not applicable for Jetson platforms)
  • MobileNetV1 and MobileNetV2 networks had up to a 14% performance regression in FP32 mode. This issue has been fixed in this release.

Announcements

  • Support for Python 2 will be dropped in a future TensorRT release. This means that TensorRT will no longer include wheels for Python 2, and Python samples will not work with Python 2. Ensure you migrate your application to Python version 3.

Known Issues

  • The diagram in IRNNv2Layer is incorrect. This will be fixed in a future release.
  • There is a known issue that graph capture may fail in some cases for IExecutionContext::enqueue() and IExecutionContext::enqueueV2(). For more information, refer to the documentation for IExecutionContext::enqueueV2(), including how to work around this issue.
  • There is up to 23% performance regression on Volta GPUs for some RNN networks. (not applicable for Jetson platforms)
  • Some fusions are not enabled when the static library is used. This means there is a performance loss of around 10% for networks like BERT and YOLO3. The performance loss depends on precision used and batch size and it can be up to 60% in some cases.
  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit those weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().
  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.
  • There is a known accuracy issue of 3D U-Net networks on NVIDIA Ampere GPUs where TF32 mode is enabled by default. To workaround this issue, TF32 mode can be disabled via TensorRT or by setting the environment variable NVIDIA_TF32_OVERRIDE=0 when an engine is built. For more information and how to control TF32, see Enabling TF32 Inference Using C++ in the TensorRT Developer Guide.
  • Convolution layers with dynamic shapes and large range of possible index dimensions in the profile have a known build time performance issue. This can be bypassed by using IAlgorithmSelector and disabling cudnnConvolution tactics.
  • If you are starting with a clean system installation and you have not installed the CUDA Toolkit prior to installing the TensorRT samples, then you may need to manually install cuda-nvcc-XX-Y and cuda-nvprof-XX-Y, where XX-Y matches the CUDA major and minor version for your desired setup. Without these additional packages, you may encounter compile errors while building the TensorRT samples. These additional dependencies will be corrected in a future release.

TensorRT Release 7.2.0

Attention:

This is the TensorRT 7.2.0 release notes. We recommend PowerPC users download the TensorRT 7.2.0 build for production use. For Linux and JetPack users, TensorRT 7.2.0 is a Release Candidate (RC). As an RC release, this is a Preview for early testing and feedback. For production use of TensorRT for Linux and JetPack users, we recommend downloading TensorRT 7.1.3. The RC release is subject to change based on ongoing performance tuning and functional testing. For feedback, submit a bug on the NVIDIA Developer website.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 7.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
FullyConnected Layer optimization
  • Improved performance with Tensor Core in INT8 mode.
  • TensorRT now uses cuBLASLt internally instead of cuBLAS. This decreases the overall runtime memory footprint. Users can revert to the old behavior by using the new setTacticSources API in IBuilderConfig.

Compatibility

Limitations

  • TensorRT 7.2 only supports per-tensor quantization scales for both activations and weights in explicit precision mode. No shift weights are allowed for the QDQ scale layer as only symmetric quantization is supported. For more information, refer to the Working With Explicit Precision Using C++ in the TensorRT Developer Guide for more information.
  • When using reformat-free I/O, the extent of a tensor in a vectorized dimension might not be a multiple of the vector length. Elements in a partially occupied vector that are not within the tensor are referred to here as vector-padding. For example:
    • On GPU
      • for input tensors, the application shall set vector-padding elements to zero.
      • for output tensors, the value of vector-padding elements is undefined. In a future release, TensorRT will support setting them to zero.
    • On DLA
      • for input tensors, vector-padding elements are ignored.
      • for output tensors, vector-padding elements are unmodified.

Fixed Issues

  • When using an RPM file on RedHat for a cuDNN installation, upgrading from cuDNN v7 to cuDNN v8 directly or indirectly via TensorRT 7.1.3 would cause installation errors. This issue has been fixed in the cuDNN 8.0.2 release.

Known Issues

  • There is a known package dependency issue when installing the python-libnvinfer RPM package on RHEL/CentOS 8.x. You will encounter the following error:
    - nothing provides python >= 2.7 needed by python-libnvinfer-7.2.0-1.cuda11.0.ppc64le

    Listed below are two options you can choose from to workaround this packaging issue:

    Option 1: Install the RPM package by ignoring the missing dependency.
    # Install TensorRT and Python 2.x first
    sudo yum install tensorrt python2
    # Download the RPM package and install the package directly
    sudo yum install yum-utils
    yumdownloader python-libnvinfer
    sudo rpm -Uvh --nodeps python-libnvinfer-*.rpm
    

    Option 2: Install the TensorRT Python bindings using the Python wheel file.

    An alternative to installing the RPM package for the Python bindings is to instead install the Python wheel file from the TAR package using pip. Refer to step 6 within the Tar File Installation section of the TensorRT Installation Guide.

    The Python 3.x RPM packages are not affected by this dependency issue. This issue will be resolved in the next release.

    (not applicable for Jetson platforms)

  • There is a known performance regression on some RNN networks:
    • up to 12% on Pascal and Turing GPUs
    • up to 20% on Volta GPUs
    (not applicable for Jetson platforms)
  • There is a known performance regression on P100:
    • 30% regression on 3D networks like 3D U-Net in FP32 mode
    (not applicable for Jetson platforms)
  • There is up to a 10% performance regression for Inception V4 networks in FP32 mode on P100 and V100. (not applicable for Jetson platforms)

  • The diagram in IRNNv2Layer is incorrect. This will be fixed in a future release.

  • There is a known issue that graph capture may fail in some cases for IExecutionContext::enqueue() and IExecutionContext::enqueueV2(). For more information, refer to the documentation for IExecutionContext::enqueueV2(), including how to work around this issue.

  • On PowerPC, some RNN networks have up to a 15% performance regression compared to TensorRT 7.0. (not applicable for Jetson platforms)

  • MobileNetV1 and MobileNetV2 networks have up to a 14% performance regression in FP32 mode.

  • Some fusions are not enabled in the following cases:
    • Windows with CUDA 11.0
    • When the static library is used
    This means there is a performance loss of around 10% for networks like BERT and YOLO3. The performance loss depends on precision used and batch size and it can be up to 60% in some cases.
  • Loops and DataType::kBOOL are not supported when the static TensorRT library is used.

  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit the weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().

  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.

  • When using concat on the DLA, all inputs to concat must be exact multiples of the vector size (16 for FP16, 32 for INT8). This will be fixed in a future release of TensorRT.

TensorRT Release 7.1.3

Attention:

This is the TensorRT 7.1.3 GA release notes. For production use of TensorRT, we recommend using the TensorRT 7.1.3 build for CUDA 10.2. The CUDA 11.0 RC build is a Preview release for early testing and feedback on NVIDIA A100. This release is subject to change based on ongoing performance tuning and functional testing. For feedback, submit a bug on the NVIDIA Developer website.

These release notes are applicable to JetPack users of TensorRT unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 7.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.

Working with empty tensors

TensorRT supports empty tensors. A tensor is an empty tensor if it has one or more dimensions with length zero. Zero-length dimensions usually get no special treatment. If a rule works for a dimension of length L for an arbitrary positive value of L, it usually works for L=0 too. For more information, see Working With Empty Tensors in the TensorRT Developer Guide.

Builder layer timing cache

The layer timing cache will cache the layer profiling information during the builder phase. If there are other layers with the same input/output tensor configuration and layer params, then the TensorRT builder will skip profiling and reuse the cached result for the repeated layers. Models with many repeated layers (for example, BERT, WaveGlow, etc...) will see a significant speedup in builder time. The builder flag kDISABLE_TIMING_CACHE can be set if you want to disable this feature. For more information, see Builder Layer Timing Cache in the TensorRT Developer Guide and Initializing The Engine in the Best Practices For TensorRT Performance.

Pointwise fusion based on code generation

Pointwise fusion was introduced in TensorRT 6.0.1 to fuse multiple adjacent pointwise layers into one single layer. In this release, its implementation has been updated to use code generation and runtime compilation to further improve performance. The code generation and runtime compilation happen during execution plan building. For more information, see the TensorRT Best Practices Guide.

Dilation support for deconvolution

IDeconvolutionLayer now supports a dilation parameter. This is accessible through the C++ API, Python API, and the ONNX parser (see ConvTranspose). For more information, IDeconvolutionLayer in the TensorRT Developer Guide.

Selecting FP16 and INT8 kernels

TensorRT supports Mixed Precision Inference with FP32, FP16, or INT8 as supported precisions. Depending on the hardware support, you can choose to enable either of the above precision to accelerate inference. You can also choose to execute trtexec with the “--best” option directly, which would enable all supported precisions for inference resulting in best performance. For more information, see Mixed Precision in the Best Practices For TensorRT Performance.

Calibration with dynamic shapes

INT8 calibration with dynamic shapes supports the same functionality as a standard INT8 calibrator but for networks with dynamic shapes. You will need to provide a calibration optimization profile that would be used to set the dimensions for calibration. If a calibration optimization profile is not set, the first network optimization profile will be used as a calibration optimization profile. For more information, see INT8 Calibration With Dynamic Shapes in the TensorRT Developer Guide.

Algorithm selection

Algorithm selection provides a mechanism to select and report algorithms for different layers in a network. This can also be used to deterministically build TensorRT engine or to reproduce the same implementations for layers in the engine. For more information, see the Algorithm Selection and Determinism And Reproducibility In The Builder topics in the TensorRT Developer Guide.

INT8 calibration

The Legacy class IInt8LegacyCalibrator is un-deprecated. It is provided as a fallback option if the other calibrators yield poor results. A new kCALIBRATION_BEFORE_FUSION has been added which allows calibration before fusion. For more information, see INT8 Calibration Using C++ in the TensorRT Developer Guide.

Quantizing and dequantizing scale layers

A quantizing scale layer can be specified as a scale layer with output precision type of INT8. Similarly, a dequantizing scale layer can be specified as a scale layer with output precision type of FP32. Networks must be created with Explicit Precision mode to use these layers. Quantizing and dequantizing (QDQ) scale layers only support per-tensor quantization scales i.e. a single scale per tensor. Also, No shift weights are allowed for the QDQ scale layer as only symmetric quantization is supported. For more information, see Working With Explicit Precision Using C++ in the TensorRT Developer Guide.

Samples compilation

A new Makefile option TRT_STATIC=1 has been added which allows you to build the TensorRT samples with TensorRT and most dependent libraries statically linked into the sample binary.

Group normalization plugin

A new group normalization plugin has been added. For details on group normalization, refer to the Group Normalization paper.

TF32 support

TF32 is enabled by default for DataType::kFLOAT. On the NVIDIA Ampere architecture-based A100/GA100 GPU, TF32 can speed up networks using FP32, typically with no loss of accuracy. It combines FP32 dynamic range and format with FP16 precision. TF32 can be disabled via TensorRT or by setting the environment variable NVIDIA_TF32_OVERRIDE=0 when an engine is built. For more information and how to control TF32, see Enabling TF32 Inference Using C++ in the TensorRT Developer Guide. (not applicable for Jetson platforms)

New plugins
Added new plugins for common operators in the BERT model, including embedding layer normalization, skip layer normalization and multi-head attention.
embLayerNormPlugin
This plugin performs the following two tasks:
  • Embeds an input sequence consisting of token IDs and segment IDs. This consists of token embedding lookup, segment embedding lookup, adding positional embeddings and finally, layer normalization.
  • Preprocesses input masks that are used to mark valid input tokens in sequences that are padded to the target sequence length. It assumes contiguous input masks and encodes the masks as a single number denoting the number of valid elements. This plugin supports FP32 mode and FP16 mode.
skipLayerNormPlugin
This plugin adds a residual tensor, and applies layer normalization, meaning, transforming the mean and standard deviation to beta and gamma, respectively. Optionally, it can add a bias vector before layer normalization. This plugin supports FP32 mode, FP16 mode, and INT8 mode. It may bring a negative impact on the end-to-end prediction accuracy when running under INT8 mode.
bertQKVToContextPlugin
This plugin takes query, key, and value tensors and computes scaled multi-head attention, that is to compute scaled dot product attention scores SoftMax(K' * Q / sqrt(HeadSize)) and return values weighted by these attention scores. This plugin supports FP32 mode, FP16 mode, and INT8 mode. It is optimized for sequence lengths 128 and 384, and INT8 mode is only available for those sequence lengths.
These plugins only support GPUs with compute capability >= 7.0. For more information about these new BERT-related plugins, see TensorRT Open Source Plugins.

New sample
sampleAlgorithmSelector

sampleAlgorithmSelector shows an example of how to use the algorithm selection API based on sampleMNIST. This sample demonstrates the usage of IAlgorithmSelector to deterministically build TensorRT engines. It also shows the usage of IAlgorithmSelector::selectAlgorithms to define heuristics for selection of algorithms. For more information, see Algorithm Selection in the TensorRT Developer Guide, Algorithm Selection API Usage Example Based On sampleMNIST In TensorRT in the TensorRT Samples Support Guide.

onnx_packnet

onnx_packnet is a Python sample which uses TensorRT to perform inference with the PackNet network. PackNet is a self-supervised monocular depth estimation network used in autonomous driving. For more information, refer to TensorRT Inference Of ONNX Models With Custom Layers in the TensorRT Sample Support Guide.

Multi-Instance GPU (MIG)

Multi-instance GPU, or MIG, is a new feature in NVIDIA Ampere GPU architecture that enables user-directed partitioning of a single GPU into multiple smaller GPUs. This improves GPU utilization by enabling the GPU to be shared effectively by parallel compute workloads on bare metal, GPU pass through, or on multiple vGPUs. For more information, refer to Working With Multi-Instance GPU in the TensorRT Developer Guide. (not applicable for Jetson platforms)

Improved ONNX Resize operator support

The ONNX resize modes asymmetric, align_corners, half_pixel, and pytorch_half_pixel are now supported. For more information on these resize modes, see the ONNX Resize Operator Specification.

Compatibility

Limitations

  • TensorRT 7.1 only supports per-tensor quantization scales for both activations and weights in explicit precision mode. No shift weights are allowed for the QDQ scale layer as only symmetric quantization is supported. For more information, refer to the Working With Explicit Precision Using C++ in the TensorRT Developer Guide for more information.

Deprecated Features

The following features are deprecated in TensorRT 7.1.3:
  • The fc_plugin_caffe_mnist Python sample has been deprecated. The FCPlugin is not selected by fc_plugin_caffe_mnist which was intended to demonstrate its usage. This is because there is no default importer for FCPlugin in the Caffe parser.

  • Python 2.7 support has been deprecated. A warning will be emitted when you import the TensorRT bindings for Python 2.7. You should update your application to support Python 3.x to prevent issues with future TensorRT releases. In addition, the legacy Python bindings have been removed. You will need to migrate your application to the new Python bindings if you haven’t done so already. Refer to the Python Migration Guide for more information.

  • Support for CUDA Compute Capability version 3.0 has been removed. Support for CUDA Compute Capability versions 5.0 and lower may be removed in a future release. Specifically:
    CUDA Compute Capability Version Status
    Maxwell SM 5.0 (2014-2017):
    • GM10X - GeForce 745
    • GM10X - GeForce 750
    • GM10X - GeForce 830
    • GM10X - GeForce 840
    • Quadro K620
    • Quadro K1200
    • Quadro K2200
    • M5XX
    • M6XX
    • M1XXX
    • M2000
    Supported
    Kepler SM 3.7 (2014):
    • GK210 - K8
    Deprecated
    Kepler SM 3.5 (2013):
    • GK110 - K20
    • GeForce GTX 780 family
    • GTX Titan
    Deprecated
    Kepler SM 3.0 (2012):
    • GK10X GPUs
    • GeForce 600 series
    • K10
    • GRID K1/K2
    • Quadro K series
    Removed
  • Many methods of class IBuilder have been deprecated. The following table shows deprecated methods of class IBuilder that have replacements in IBuilder:
    Deprecated IBuilder Method IBuilder Replacement
    createNetwork() createNetworkV2(0)
    buildCudaEngine(network) buildEngineWithConfig(network,config)
    reset(network) reset()
    The next table shows the deprecated methods of IBuilder that have direct equivalents in class IBuilderConfig with the same name.
    Deprecated IBuilder Methods with Direct Equivalents in IBuilderConfig
    • setMaxWorkspaceSize
    • getMaxWorkspaceSize
    setInt8Calibrator
    • setDeviceType
    • getDeviceType
    • isDeviceTypeSet
    • resetDeviceType
    • setDefaultDeviceType
    • getDefaultDeviceType
    canRunOnDLA
    • setDLACore
    • getDLACore
    • setEngineCapability
    • getEngineCapability
    Timing methods in IBuilder also have replacements in IBuilderConfig, with new names.
    Deprecated IBuilder Method Replacement In IBuilderConfig
    setMinFindIterations setMinTimingIterations
    getMinFindIterations getMinTimingIterations
    setAverageFindIterations setAvgTimingIterations
    getAverageFindIterations getAvgTimingIterations
    Finally, some IBuilder methods related to boolean properties have been replaced with methods for setting/getting flags. For example, these calls on an IBuilder:
    builder.setHalf2Mode(true);
    builder.setInt8Mode(false);
    
    can be replaced with these calls on a IBuilderConfig:
    config.setFlag(BuilderFlag::kFP16);
    config.clearFlag(BuilderFlag::kINT8);
    The following table lists the deprecated methods and the corresponding flag.
    Deprecated IBuilder Method Corresponding Flag
    • setHalf2Mode
    • setFp16Mode
    • getHalf2Mode
    • getFp16Mode
    BuilderFlag::kFP16
    • setInt8Mode
    • getInt8Mode
    BuilderFlag::kINT8
    setDebugSync BuilderFlag::kDEBUG
    • setRefittable
    • getRefittable
    BuilderFlag::kREFIT
    • setStrictTypeConstraints
    • getStrictTypeConstraints
    BuilderFlag::kSTRICT_TYPES
    allowGPUFallback BuilderFlag::kGPU_FALLBACK
  • The INvPlugin creator function has been deprecated since TensorRT 5.1.x and has now been fully removed. We recommend that users upgrade their plugins to one of the later plugin interfaces, refer to Extending TensorRT With Custom Layers section in the TensorRT Developer Guide for more information.

Fixed Issues

  • Fixed memory leaks in engine serialization when UFF models are used.

  • Fixed a crash during engine build for networks with RNNv2 on Windows.

  • Statically linking with TensorRT library resulted in segfault in certain cases. The issue is now fixed.

  • Fixed multiple bugs related to dynamic shapes, specifically:
    • padding modes for convolution and deconvolution,
    • engines with multiple optimization profiles, and
    • empty tensors (tensors with zero volume).

Announcements

  • Boolean shape tensors now supported:
  • NVIDIA TensorRT Inference Server has been renamed to NVIDIA Triton Inference Server. For more information, refer to the Triton Inference Server documentation.

Known Issues

  • In the CUDA 11.0 RC release, there is a known performance regression on some RNN networks:
    • up to 50% on Turing GPUs
    • up to 12% on Pascal and Volta GPUs
  • There is known performance regression between 30-80% for networks like ResNet-50 and MobileNet when run in FP16 mode on SM50 devices.

  • The Windows library size is 600 MB bigger than the Linux library size. This will be fixed in the next release.

  • Static compiling of samples with the CentOS7 CUDA 11.0 RC build fails.

  • There is a known performance regressions on P100:
    • 50-100% regression on 3D networks like 3D U-Net
    • 17% on Xception in FP16 mode
  • There is a known performance regression for Inception V3 and V4 networks in FP32 mode:
    • up to 60% on V100
    • up to 15% on RTX6000
  • Some fusions are not enabled on Windows with CUDA 11. This would mean performance loss of around 10% for networks like YOLO3.

  • The diagram in IRNNv2Layer is incorrect. This will be fixed in a future release.

  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit the weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().

  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.

  • Some fusions are not enabled in the following cases:
    • Windows with CUDA 11
    • When the static library is used
    This means there is a performance loss of around 10% for networks like BERT and YOLO3. The performance loss depends on precision used and batch size and it can be up to 60% in some cases.
  • Loops and DataType::kBOOL are not supported when the static TensorRT library is used.

  • There is an error in the config.py file included in the sampleUffFasterRCNN sample. Specifically, line 34 in the config file should be changed from: dynamic_graph.remove('input_2') todynamic_graph.remove(dynamic_graph.find_nodes_by_name('input_2'))

  • Updated: June 25, 2020
    When using an RPM file on RedHat for installation, installing cuDNN v8 directly or via TensorRT 7.1.3 will enable users to build their application with cuDNN v8. However, in order for the user to compile an application with cuDNN v7 after cuDNN v8 is installed, the user will need to perform the following steps:
    1. Issue sudo mv /usr/include/cudnn.h /usr/include/cudnn_v8.h.
    2. Issue sudo ln -s /etc/alternatives/libcudnn /usr/include/cudnn.h.
    3. Switch to cuDNN v7 by issuing sudo update-alternatives --config libcudnn and choose cuDNN v7 from the list.

    Steps 1 and 2 are required for the user to be able to switch between v7 and v8 installations. After steps 1 and 2 are performed once, step 3 can be used repeatedly and the user can choose the appropriate cuDNN version to work with. For more information, refer to the Installing From An RPM File and Upgrading From v7 To v8 sections in the cuDNN Installation Guide.

TensorRT Release 7.1.2 Release Candidate (RC)

These are the TensorRT 7.1.2 Release Candidate (RC) release notes and are applicable to data center and workstation Linux users. This release includes several fixes from the previous TensorRT 7.x.x release as well as the following additional changes. These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
INT8 calibration

The Legacy class IInt8LegacyCalibrator is un-deprecated. It is provided as a fallback option if the other calibrators yield poor results. A new kCALIBRATION_BEFORE_FUSION has been added which allows calibration before fusion. For more information, see INT8 Calibration Using C++ in the TensorRT Developer Guide.

Quantizing and dequantizing scale layers

A quantizing scale layer can be specified as a scale layer with output precision type of INT8. Similarly, a dequantizing scale layer can be specified as a scale layer with output precision type of FP32. Networks must be created with Explicit Precision mode to use these layers. Quantizing and dequantizing (QDQ) scale layers only support per-tensor quantization scales i.e. a single scale per tensor. Also, No shift weights are allowed for the QDQ scale layer as only symmetric quantization is supported. For more information, see Working With Explicit Precision Using C++ in the TensorRT Developer Guide.

Samples compilation

A new Makefile option TRT_STATIC=1 has been added which allows you to build the TensorRT samples with TensorRT and most dependent libraries statically linked into the sample binary.

Group normalization plugin

A new group normalization plugin has been added. For details on group normalization, refer to the Group Normalization paper.

TF32 support

TF32 is enabled by default for DataType::kFLOAT. On the NVIDIA Ampere architecture-based A100/GA100 GPU, TF32 can speed up networks using FP32, typically with no loss of accuracy. It combines FP32 dynamic range and format with FP16 precision. TF32 can be disabled via TensorRT or by setting the environment variable NVIDIA_TF32_OVERRIDE=0 when an engine is built. For more information and how to control TF32, see Enabling TF32 Inference Using C++ in the TensorRT Developer Guide. (not applicable for Jetson platforms)

New plugins
Added new plugins for common operators in the BERT model, including embedding layer normalization, skip layer normalization and multi-head attention.
embLayerNormPlugin
This plugin performs the following two tasks:
  • Embeds an input sequence consisting of token IDs and segment IDs. This consists of token embedding lookup, segment embedding lookup, adding positional embeddings and finally, layer normalization.
  • Preprocesses input masks that are used to mark valid input tokens in sequences that are padded to the target sequence length. It assumes contiguous input masks and encodes the masks as a single number denoting the number of valid elements. This plugin supports FP32 mode and FP16 mode.
skipLayerNormPlugin
This plugin adds a residual tensor, and applies layer normalization, meaning, transforming the mean and standard deviation to beta and gamma, respectively. Optionally, it can add a bias vector before layer normalization. This plugin supports FP32 mode, FP16 mode, and INT8 mode. It may bring a negative impact on the end-to-end prediction accuracy when running under INT8 mode.
bertQKVToContextPlugin
This plugin takes query, key, and value tensors and computes scaled multi-head attention, that is to compute scaled dot product attention scores SoftMax(K' * Q / sqrt(HeadSize)) and return values weighted by these attention scores. This plugin supports FP32 mode, FP16 mode, and INT8 mode. It is optimized for sequence lengths 128 and 384, and INT8 mode is only available for those sequence lengths.
These plugins only support GPUs with compute capability >= 7.0. For more information about these new BERT-related plugins, see TensorRT Open Source Plugins.

Compatibility

Limitations

  • TensorRT 7.1 only supports per-tensor quantization scales for both activations and weights in explicit precision mode. No shift weights are allowed for the QDQ scale layer as only symmetric quantization is supported. For more information, refer to the Working With Explicit Precision Using C++ in the TensorRT Developer Guide for more information.

Deprecated Features

The following features are deprecated in TensorRT 7.1.2:
  • The fc_plugin_caffe_mnist Python sample has been deprecated. The FCPlugin is not selected by fc_plugin_caffe_mnist which was intended to demonstrate its usage. This is because there is no default importer for FCPlugin in the Caffe parser.

Announcements

Known Issues

  • There is a known issue that graph capture may fail in some cases for IExecutionContext::enqueue() and IExecutionContext::enqueueV2(). For more information, refer to the documentation for IExecutionContext::enqueueV2(), including how to work around this issue.

  • There is a known ~40% performance regression on 3D networks like 3D Unet.

  • There is a known ~50% performance regression on LSTM autoencoder with BS=8.

  • There is a minor performance regression across a variety of networks that will be fixed in TensorRT 7.1.x GA.

  • The diagram in IRNNv2Layer is incorrect. This will be fixed in TensorRT 7.1.x GA.

  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit the weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().

  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.

TensorRT Release 7.1.0 Early Access (EA)

These are the TensorRT 7.1.0 Early Access (EA) release notes and are applicable to NVIDIA® Jetson™ Linux for Tegra™ users. This release includes several fixes from the previous TensorRT 6.0.0 and later releases as well as the following additional changes. These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This preview release is for early testing and feedback, therefore, for production use of TensorRT, continue to use TensorRT 7.0.0.

For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
Working with empty tensors

TensorRT supports empty tensors. A tensor is an empty tensor if it has one or more dimensions with length zero. Zero-length dimensions usually get no special treatment. If a rule works for a dimension of length L for an arbitrary positive value of L, it usually works for L=0 too. For more information, see Working With Empty Tensors in the TensorRT Developer Guide.

Builder layer timing cache

The layer timing cache will cache the layer profiling information during the builder phase. If there are other layers with the same input/output tensor configuration and layer params, then the TensorRT builder will skip profiling and reuse the cached result for the repeated layers. Models with many repeated layers (for example, BERT, WaveGlow, etc...) will see a significant speedup in builder time. The builder flag kDISABLE_TIMING_CACHE can be set if you want to disable this feature. For more information, see Builder Layer Timing Cache in the TensorRT Developer Guide and Initializing The Engine in the Best Practices For TensorRT Performance.

Pointwise fusion based on code generation

Pointwise fusion was introduced in TensorRT 6.0.1 to fuse multiple adjacent pointwise layers into one single layer. In this release, its implementation has been updated to use code generation and runtime compilation to further improve performance. The code generation and runtime compilation happen during execution plan building. For more information, see the TensorRT Best Practices Guide.

Dilation support for deconvolution

IDeconvolutionLayer now supports a dilation parameter. This is accessible through the C++ API, Python API, and the ONNX parser (see ConvTranspose). For more information, IDeconvolutionLayer in the TensorRT Developer Guide.

Selecting FP16 and INT8 kernels

TensorRT supports Mixed Precision Inference with FP32, FP16, or INT8 as supported precisions. Depending on the hardware support, you can choose to enable either of the above precision to accelerate inference. You can also choose to execute trtexec with the “--best” option directly, which would enable all supported precisions for inference resulting in best performance. For more information, see Mixed Precision in the Best Practices For TensorRT Performance.

Calibration with dynamic shapes

INT8 calibration with dynamic shapes supports the same functionality as a standard INT8 calibrator but for networks with dynamic shapes. You will need to provide a calibration optimization profile that would be used to set the dimensions for calibration. If a calibration optimization profile is not set, the first network optimization profile will be used as a calibration optimization profile. For more information, see INT8 Calibration With Dynamic Shapes in the TensorRT Developer Guide.

Algorithm selection

Algorithm selection provides a mechanism to select and report algorithms for different layers in a network. This can also be used to deterministically build TensorRT engine or to reproduce the same implementations for layers in the engine. For more information, see the Algorithm Selection and Determinism And Reproducibility In The Builder topics in the TensorRT Developer Guide.

New sample

sampleAlgorithmSelector shows an example of how to use the algorithm selection API based on sampleMNIST. This sample demonstrates the usage of IAlgorithmSelector to deterministically build TensorRT engines. It also shows the usage of IAlgorithmSelector::selectAlgorithms to define heuristics for selection of algorithms. For more information, see Algorithm Selection in the TensorRT Developer Guide, Algorithm Selection API Usage Example Based On sampleMNIST In TensorRT in the TensorRT Samples Support Guide.

Compatibility

Deprecated Features

The following features are deprecated in TensorRT 7.1.0:
  • Python 2.7 support has been deprecated. A warning will be emitted when you import the TensorRT bindings for Python 2.7. You should update your application to support Python 3.x to prevent issues with future TensorRT releases. In addition, the legacy Python bindings have been removed. You will need to migrate your application to the new Python bindings if you haven’t done so already. Refer to the Python Migration Guide for more information.

  • Support for CUDA Compute Capability version 3.0 has been removed. Support for CUDA Compute Capability versions 5.0 and lower may be removed in a future release. Specifically:
    CUDA Compute Capability Version Status
    Maxwell SM 5.0 (2014-2017):
    • GM10X - GeForce 745
    • GM10X - GeForce 750
    • GM10X - GeForce 830
    • GM10X - GeForce 840
    • Quadro K620
    • Quadro K1200
    • Quadro K2200
    • M5XX
    • M6XX
    • M1XXX
    • M2000
    Supported
    Kepler SM 3.7 (2014):
    • GK210 - K8
    Deprecated
    Kepler SM 3.5 (2013):
    • GK110 - K20
    • GeForce GTX 780 family
    • GTX Titan
    Deprecated
    Kepler SM 3.0 (2012):
    • GK10X GPUs
    • GeForce 600 series
    • K10
    • GRID K1/K2
    • Quadro K series
    Removed
  • Many methods of class IBuilder have been deprecated. The following table shows deprecated methods of class IBuilder that have replacements in IBuilder:
    Deprecated IBuilder Method IBuilder Replacement
    createNetwork() createNetworkV2(0)
    buildCudaEngine(network) buildEngineWithConfig(network,config)
    reset(network) reset()
    The next table shows the deprecated methods of IBuilder that have direct equivalents in class IBuilderConfig with the same name.
    Deprecated IBuilder Methods with Direct Equivalents in IBuilderConfig
    • setMaxWorkspaceSize
    • getMaxWorkspaceSize
    setInt8Calibrator
    • setDeviceType
    • getDeviceType
    • isDeviceTypeSet
    • resetDeviceType
    • setDefaultDeviceType
    • getDefaultDeviceType
    canRunOnDLA
    • setDLACore
    • getDLACore
    • setEngineCapability
    • getEngineCapability
    Timing methods in IBuilder also have replacements in IBuilderConfig, with new names.
    Deprecated IBuilder Method Replacement In IBuilderConfig
    setMinFindIterations setMinTimingIterations
    getMinFindIterations getMinTimingIterations
    setAverageFindIterations setAvgTimingIterations
    getAverageFindIterations getAvgTimingIterations
    Finally, some IBuilder methods related to boolean properties have been replaced with methods for setting/getting flags. For example, these calls on an IBuilder:
    builder.setHalf2Mode(true);
    builder.setInt8Mode(false);
    
    can be replaced with these calls on a IBuilderConfig:
    config.setFlag(BuilderFlag::kFP16);
    config.clearFlag(BuilderFlag::kINT8);
    The following table lists the deprecated methods and the corresponding flag.
    Deprecated IBuilder Method Corresponding Flag
    • setHalf2Mode
    • setFp16Mode
    • getHalf2Mode
    • getFp16Mode
    BuilderFlag::kFP16
    • setInt8Mode
    • getInt8Mode
    BuilderFlag::kINT8
    setDebugSync BuilderFlag::kDEBUG
    • setRefittable
    • getRefittable
    BuilderFlag::kREFIT
    • setStrictTypeConstraints
    • getStrictTypeConstraints
    BuilderFlag::kSTRICT_TYPES
    allowGPUFallback BuilderFlag::kGPU_FALLBACK
  • The INvPlugin creator function has been deprecated since TensorRT 5.1.x and has now been fully removed. We recommend that users upgrade their plugins to one of the later plugin interfaces, refer to Extending TensorRT With Custom Layers section in the TensorRT Developer Guide for more information.

Fixed Issues

  • DLA has restrictions on usage that were previously undocumented. Some programs that might have worked, but violated these restrictions, are now expected to fail at build time. For more information, see Restrictions With DLA and FAQs in the TensorRT Developer Guide.

Announcements

Known Issues

  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit the weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().

  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.

TensorRT Release 7.0.0

These are the TensorRT 7.0.0 release notes for Linux and Windows users. This release includes fixes from the previous TensorRT 6.0.1 release as well as the following additional changes. These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

For previous TensorRT release notes, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
Working with loops

TensorRT supports loop-like constructs, which can be useful for recurrent networks. TensorRT loops support scanning over input tensors, recurrent definitions of tensors, and both “scan outputs” and “last value” outputs. For more information, see Working With Loops in the TensorRT Developer Guide.

ONNX parser with dynamic shapes support

The ONNX parser supports full-dimensions mode only. Your network definition must be created with the explicitBatch flag set. For more information, see Importing An ONNX Model Using The C++ Parser API and Working With Dynamic Shapes in the TensorRT Developer Guide for more information.

TensorRT container with OSS

The TensorRT monthly container release now contains pre-built binaries from the TensorRT Open Source Repository. For more information, refer to the monthly released TensorRT Container Release Notes starting in 19.12+.

BERT INT8 and mixed precision optimizations

Some GEMM layers are now followed by GELU activation in the BERT model. Since TensorRT doesn’t have IMMA GEMM layers, you can implement those GEMM layers in the BERT network with either IConvolutionLayer or IFullyConnectedLayer layers depending on what precision you require. For example, you can leverage IConvolutionLayer with H == W == 1 (CONV1x1) to implement a FullyConnected operation and leverage IMMA math under INT8 mode. TensorRT supports the fusion of Convolution/FullyConnected and GELU. For more information, refer to TensorRT Best Practices Guide and Adding Custom Layers Using The C++ API in the TensorRT Developer Guide.

Working with Quantized Networks

TensorRT now supports quantized models trained with Quantization Aware Training. Support is limited to symmetrically quantized models, meaning zero_point = 0 using QuantizeLinear and DequantizeLinear ONNX ops. For more information, see Working With Quantized Networks in the TensorRT Developer Guide and QDQ Fusions in the Best Practices For TensorRT Performance Guide.

New layers
IFillLayer

The IFillLayer is used to generate an output tensor with the specified mode. For more information, see the C++ class IFillLayer or the Python class IFillLayer.

IIteratorLayer

The IIteratorLayer enables a loop to iterate over a tensor. A loop is defined by loop boundary layers. For more information, see the C++ class IIteratorLayer or the Python class IIteratorLayer and Working With Loops in the TensorRT Developer Guide.

ILoopBoundaryLayer

Class ILoopBoundaryLayer defines a virtual method getLoop() that returns a pointer to the associated ILoop. For more information, see the C++ class ILoopBoundaryLayer or the Python class ILoopBoundaryLayer and Working With Loops in the TensorRT Developer Guide.

ILoopOutputLayer

The ILoopOutputLayer specifies an output from the loop. For more information, see the C++ class ILoopOutputLayer or the Python class ILoopOutputLayer and Working With Loops in the TensorRT Developer Guide.

IParametricReluLayer

The IParametricReluLayer represents a parametric ReLU operation, meaning, a leaky ReLU where the slopes for x < 0 can be different for each element. For more information, see the C++ class IParametricReluLayer or the Python class IParametricReluLayer.

IRecurrenceLayer

The IRecurrenceLayer specifies a recurrent definition. For more information, see the C++ class IRecurrenceLayer or the Python class IRecurrenceLayer and Working With Loops in the TensorRT Developer Guide.

ISelectLayer

The ISelectLayer returns either of the two inputs depending on the condition. For more information, see the C++ class ISelectLayer or the Python class ISelectLayer.

ITripLimitLayer

The ITripLimitLayer specifies how many times the loop iterates. For more information, see the C++ class ITripLayer or the Python class ITripLayer and Working With Loops in the TensorRT Developer Guide.

New operations

ONNX: Added ConstantOfShape, DequantizeLinear, Equal, Erf, Expand, Greater, GRU, Less, Loop, LRN, LSTM, Not, PRelu, QuantizeLinear, RandomUniform, RandomUniformLike, Range, RNN, Scan, Sqrt, Tile, and Where.

For more information, see the full list of Supported Ops in the Support Matrix.

Boolean tensor support

TensorRT supports boolean tensors which can be marked as network input and output. IElementWiseLayer, IUnaryLayer (only kNOT), IShuffleLayer, ITripLimit (only kWHILE) and ISelectLayer support the boolean datatype. Boolean tensors can be used only with FP32 and FP16 precision networks. For more information, refer to the Layers section in the TensorRT Developer Guide.

Compatibility

Limitations

  • UFF samples, such as sampleUffMNIST, sampleUffSSD, sampleUffPluginV2Ext, sampleUffMaskRCNN, sampleUffFasterRCNN, uff_custom_plugin, and uff_ssd, support TensorFlow 1.x and not models trained with TensorFlow 2.0.

  • Loops and DataType::kBOOL are supported on limited platforms. On platforms without loop support, INetworkDefinition::addLoop returns nullptr. Attempting to build an engine using operations that consume or produce DataType::kBOOL on a platform without support, results in validation rejecting the network. For details on which platforms are supported with loops, refer to the Features For Platforms And Software section in the TensorRT Support Matrix.

  • Explicit precision networks with quantized and de-quantized nodes are only supported on devices with hardware INT8 support. Running on devices without hardware INT8 support results in undefined behavior.

Deprecated Features

The following features are deprecated in TensorRT 7.0.0:
  • Backward Compatibility and Deprecation Policy - When a new function, for example foo, is first introduced, there is no explicit version in the name and the version is assumed to be 1. When changing the API of an existing TensorRT function foo (usually to support some new functionality), first, a new routine fooV<N> is created where N represents the Nth version of the function and the previous version fooV<N-1> remains untouched to ensure backward compatibility. At this point, fooV<N-1> is considered deprecated, and should be treated as such by users of TensorRT.

    Starting with TensorRT 7, we will be eliminating deprecated API per the following policy.
    • APIs already marked deprecated prior to TensorRT 7 (6 and older) will be removed in the next major release of TensorRT 8.
    • APIs deprecated in TensorRT <M>, where M is the major version greater than or equal to 7, will be removed in TensorRT <M+2>. This means that deprecated APIs remain functional for two major releases before they are removed.
  • Deprecation of Caffe Parser and UFF Parser - We are deprecating Caffe Parser and UFF Parser in TensorRT 7. They will be tested and functional in the next major release of TensorRT 8, but we plan to remove the support in the subsequent major release. Plan to migrate your workflow to use tf2onnx, keras2onnx or TensorFlow-TensorRT (TF-TRT) for deployment.

Fixed Issues

  • You no longer have to build ONNX and TensorFlow from source in order to workaround pybind11 compatibility issues. The TensorRT Python bindings are now built using pybind11 version 2.4.3.

  • Windows users are now able to build applications designed to use the TensorRT refittable engine feature. The issue related to unresolved symbols has been resolved.

  • A virtual destructor has been added to the IPluginFactory class.

Known Issues

  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so an attempt to refit the weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().

  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.

  • The INT8 calibration does not work with dynamic shapes. To workaround this issue, ensure there are two passes in the code:
    1. Using a fixed shape input to build the engine in the first pass, allows TensorRT to generate the calibration cache.
    2. Then, create the engine again using the dynamic shape input and the builder will reuse the calibration cache generated in the first pass.