TensorRT Release 4.0.1

This TensorRT 4.0.1 General Availability release includes several enhancements and improvements compared to the previously released TensorRT 3.0.4.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.

  • TensorRT 4.0.1 GA has been tested with cuDNN 7.1.3 and now requires cuDNN 7.1.x.

  • Support for ONNX 1.0 (Open Neural Network Exchange) has been implemented. ONNX is a standard for representing deep learning models that enable models to be transferred between frameworks. TensorRT can now parse the network definitions in ONNX format, in addition to NVCaffe and UFF formats.

  • The Custom Layer API now supports user-defined layers that take half precision, or FP16, inputs and return FP16 outputs.

  • Added support for the MatrixMultiply, Constant, Gather, Ragged SoftMax, Reduce, RNNv2 and TopK layers (for K up to 25).

  • This release has optimizations which target recommender systems like Neural Collaborative Filtering.

  • Many layers now support the ability to broadcast across the batch dimension.

  • In TensorRT 3.0, INT8 had issues with rounding and striding in the Activation layer. This may have caused INT8 accuracy to be low. Those issues have been fixed.

  • The C++ samples and Python examples were tested with TensorFlow 1.8 and PyTorch 0.4.0 where applicable.

  • Added sampleOnnxMNIST. This sample shows the conversion of an MNIST network in ONNX format to a TensorRT network.

  • Added sampleNMT. Neural Machine Translation (NMT) using sequence to sequence (seq2seq) models has garnered a lot of attention and is used in various NMT frameworks. sampleNMT is a highly modular sample for inferencing using C++ and TensorRT API so that you can consider using it as a reference point in your projects.

  • Updated sampleCharRNN to use RNNv2 and converting weights from TensorFlow to TensorRT.

  • Added sampleUffSSD. This sample converts the TensorFlow Single Shot MultiBox Detector (SSD) network to a UFF format and runs it on TensorRT using plugins. This sample also demonstrates how other TensorFlow networks can be preprocessed and converted to UFF format with support of custom plugin nodes.

  • Memory management improvements (see the Memory Management section in the Developer Guide for details.)
    • Applications may now provide their own memory for activations and workspace during inference, which is used only while the pipeline is running.
    • An allocator callback is available for all memory allocated on the GPU. In addition, model deserialization is significantly faster (from system memory, up to 10x faster on large models).

Using TensorRT 4.0.1

Ensure you are familiar with the following notes when using this release.
  • The builder methods setHalf2Mode and getHalf2Mode have been superseded by setFp16Mode and getFp16Mode which better represent their intended usage.

  • The sample utility giexec has been renamed to trtexec to be consistent with the product name, TensorRT, which is often shortened to TRT. A compatibility script for users of giexec has been included to help users make the transition.

Deprecated Features

  • The RNN layer type is deprecated in favor of RNNv2, however, it is still available for backwards compatibility.

  • Legacy GIE version defines in NvInfer.h have been removed. They were NV_GIE_MAJOR, NV_GIE_MINOR, NV_GIE_PATCH, and NV_GIE_VERSION. The correct alternatives are NV_TENSORRT_MAJOR, NV_TENSORRT_MINOR, NV_TENSORRT_PATCH, and NV_TENSORRT_VERSION which existed in TensorRT 3.0.4 as well.

  • Dimension types are now ignored in the API, however, they are still available for backwards compatibility.

Known Issues

  • If the ONNX parser included with TensorRT is unable to parse your model, then try updating to the latest open source ONNX parser, which may resolve your issue.

  • PyTorch no longer supports Python 3.4 with their current release (0.4.0). Therefore, the TensorRT PyTorch examples will not work when using Python 3 on Ubuntu 14.04.

  • Reshape to a tensor that has a larger number of dimensions than the input tensor is not supported.

  • Reformat has a known memory overwrite issue on Volta when FP16 is used with the Concatenation layer and the Reformat layer.

  • If you have two different CUDA versions of TensorRT installed, such as CUDA 8.0 and CUDA 9.0, or CUDA 9.2 using local repos, then you will need to execute an additional command to install the CUDA 8.0 version of TensorRT and prevent it from upgrading to the CUDA 9.0 or CUDA 9.2 versions of TensorRT.
    sudo apt-get install libnvinfer4=4.1.2-1+cuda8.0 \
      libnvinfer-dev=4.1.2-1+cuda8.0
    sudo apt-mark hold libnvinfer4 libnvinfer-dev

  • sampleNMT
    • Performance is not fully optimized

  • sampleUffSSD
    • Some precision loss was observed while running the network in INT8 mode, causing some objects to go undetected in the image. Our general observation is that having at least 500 images for calibration is a good starting point.

  • Performance regressions
    • Compared to earlier TensorRT versions, a 5% slowdown was observed on AlexNet when running on GP102 devices with batch size 2 using the NvCaffeParser.
    • Compared to earlier TensorRT versions, a 5% to 10% slowdown was observed on variants of inception and some instances of ResNet when using the NvUffParser.

  • The NvUffParser returns the output tensor in the shape specified by the user, and not in NCHW shape as in earlier versions of TensorRT. In other words, the output tensor shape will match the shape of the tensor returned by TensorFlow, for the same network.

  • The Python 3.4 documentation is missing from the Ubuntu 14.04 packages. Refer to the Python 2.7 documentation or view the online Python documentation as an alternative.

  • Some samples do not provide a -h argument to print the sample usage. You can refer to the README.txt file in the sample directory for usage examples. Also, if the data files for some samples cannot be found it will sometimes raise an exception and abort instead of exiting normally.

  • If you have more than one version of the CUDA toolkit installed on your system and the CUDA version for TensorRT is not the latest version of the CUDA toolkit, then you will need to provide an additional argument when compiling the samples. For example, you have CUDA 9.0 and CUDA 9.2 installed and you are using TensorRT for CUDA 9.0.
    make CUDA_INSTALL_DIR=/usr/local/cuda-9.0

  • When you pip uninstall the tensorrtplugins Python package, you may see the following error which can be ignored.
    OSError: [Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/tensorrtplugins-4.0.1.0-py2.7-linux-x86_64.egg'

  • Due to a bug in cuDNN 7.1.3, which is the version of cuDNN TensorRT has been validated against, using RNNs with half precision on Kepler GPUs will cause TensorRT to abort. FP16 support is non-native on Kepler GPUs, therefore, using any precision other than FP32 is discouraged except for testing.

  • sampleMovieLens is currently limited to running a maximum of 8 concurrent processes on a Titan V and may result in suboptimal engines during parallel execution. The sample will be enhanced in the near future to support a greater degree of concurrency. Additionally, to ensure compatibility with TensorRT, use TensorFlow <= 1.7.0 to train the model. There may be a conflict between the versions of CUDA and/or cuDNN used by TensorRT and TensorFlow 1.7. We suggest that you install TensorFlow 1.7 CPU in order to complete the sample.
    python -m pip install tensorflow==1.7.0