TensorRT Release 8.x.x

TensorRT Release 8.5.2

These are the TensorRT 8.5.2 Release Notes and are applicable to x86 Linux, Windows, JetPack, and PowerPC Linux users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added new plugins: fused Multihead Self-Attention, fused Multihead Cross-Attention, Layer Normalization, Group Normalization, and targeted fusions (such as Split+GeLU) to support the Stable Diffusion demo.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.
  • APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to the DLA Supported Layers section in the TensorRT Developer Guide for details.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, “Unable to load library: nvinfer_builder_resource.dll”, if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or later).
  • The DLA compiler is capable of removing identity transposes, but it cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshape). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshapes in between. The shuffle layer is translated into two consecutive DLA transpose layers, unless you merge the transposes together manually in the model definition in advance.
  • In QAT networks, for group convolution that has a Q/DQ pair before but no Q/DQ pair after, we can run in INT8-IN-FP32-OUT mix precision before. However, GPU kernels may be missed and fall back to FP32-IN-FP32-OUT in the NVIDIA Hopper™ architecture GPUs if the input channel is small. This will be fixed in the future release.
  • On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and PyTorch are unable to run due to missing Python module dependencies. These frameworks have not been built for PowerPC and/or published to standard repositories.

Deprecated and Removed Features

The following features are deprecated in TensorRT 8.5.2:
  • TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x) devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT 9.0.

Fixed Issues

  • Fixed the accuracy issue of BART with batch size > 1.
  • Memory estimation for dynamic shapes has been improved, which allows some models to run that previously did not.
  • The ONNX parser recognizes the allowzero attribute on Reshape operations for opset 5 and higher, even though ONNX spec requires it only for opset 14 and higher. Setting this attribute to 1 can correct networks that are incorrect for empty tensors, and let TensorRT analyze the memory requirements for dynamic shapes more accurately.
  • Documentation for getProfileShapeValues() incorrectly cited getProfileShape() as the preferred replacement. Now, the documentation has been corrected to cite getShapeValues() as the replacement.
  • TensorRT header files compile with gcc compilers when specified with both -Wnon-virtual-dtor and -Werror compilation options.
  • The ONNX parser was getting incorrect min and max values when they were FP16 numbers. This has been fixed in this release.
  • Improved TensorRT error handling to skip tactics instead of crashing the builder.
  • Fixed the segment fault issue for the EfficientDet network with batch size > 1.
  • Fixed the could not find any implementation error for 3D convolution with depth of kernel size equal to 1.
  • Added an error message in the ONNX parser when traning_mode=1 in BatchNorm operator.
  • Replaced the Python sample for creating custom plugins with a new sample that uses the ONNX parser instead of the UFF parser.
  • Fixed an issue which prohibited some graph fusions and caused performance degradation for StableDiffusion in FP16 precision.
  • Fixed an issue that caused some nodes within conditional subgraphs to be unsupported in INT8 calibration mode.

Announcements

  • In the next TensorRT release, CUDA toolkit 10.2 support will be dropped.
  • TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x) devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT 9.0.
  • In the next TensorRT release, cuDNN, cuBLAS, and cuBLASLt tactic sources will be turned off by default in builder profiling. TensorRT plans to remove the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. Use the PreviewFeature flag kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to evaluate the functional and performance impact of disabling cuBLAS and cuDNN and report back to TensorRT if there are critical regressions in your use cases.
  • TensorRT Python wheel files before TensorRT 8.5, such as TensorRT 8.4, were published to the NGC PyPI repo. Starting with TensorRT 8.5, Python wheels will instead be published to upstream PyPI. This will make it easier to install TensorRT because it requires no prerequisite steps. Also, the name of the Python package for installation has changed from nvidia-tensorrt to just tensorrt.
  • The C++ and Python API documentation in previous releases was included inside the tar file packaging. This release no longer bundles the documentation inside the tar file since the online documentation can be updated post release and avoids encountering mistakes found in stale documentation inside the packages.

Known Issues

Functional
  • There is a known issue with huge graphs that cause out of memory errors with specific input shapes even though a larger input shape can be run.
  • TensorRT might output wrong results when there are GEMM/Conv/MatMul ops followed by a Reshape op.
  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • The Python sample yolov3_onnx has a known issue when installing the requirements with Python 3.10. The recommendation is to use a Python version < 3.10 when running the sample.
  • The auto-tuner assumes that the number of indices returned by INonZeroLayer is half of the number of input elements. Thus, networks that depend on tighter assumptions for correctness may fail to build.
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.
  • One of the deconvolution algorithms sourced from cuDNN exhibits non-deterministic execution. Disabling cuDNN tactics will prevent this algorithm from being chosen (refer to IBuilderConfig::setTacticSources).
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fall back to using cuBLAS. (not applicable for Jetson platforms)
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
  • When using DLA, an elementwise, unary, or activation layer immediately followed by a scale layer may lead to accuracy degradation in INT8 mode. Note that this is a pre-existing issue also found in previous releases rather than a regression.
  • When using DLA, INT8 convolutions followed by FP16 layers may cause accuracy degradation. In such cases, either change the convolution to FP16 or the subsequent layer to INT8.
  • When using the algorithm selector API, the HWC1 and HWC4 DLA formats are both reported as TensorFormat::kDLA_HWC4.
  • For transformer decoder based models (such as GPT2) with sequence length as dynamic, TensorRT 8.5 requires additional workspace (up to 2x) as compared to previous releases.
Performance
  • There is a ~12% performance drop on NVIDIA Ampere architecture GPUs for the BERT network on Windows systems.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 22% performance drop for Jasper networks compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 5% performance drop for the InceptionV4 network compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is an up to 20% performance variation between different engines built from the same network for some LSTM networks due to unstable tactic selections.
  • There is an up to 11% performance variation for some LSTM networks during inference depending on the order of CUDA stream creation on NVIDIA Turing GPUs. This will be fixed in r525 drivers.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is an up to 10-11% performance regression on Xavier compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 5% performance drop for Megatron networks in FP32 precision at batch-size = 1 between CUDA 11.8 and CUDA 10.2 on NVIDIA Volta GPUs. This performance drop does not happen on NVIDIA Turing or later GPUs.
  • There is an up to 23% performance drop between H100 and A100 for some ConvNets in TF32 precision when running at the same SM clock frequency. This will be improved in future TRT versions.
  • There is an up to 8% performance drop between H100 and A100 for some transformers, including BERT, BART, T5, and GPT2, in FP16 precision at BS=1 when running at the same SM clock frequency. This will be improved in future TensorRT versions.
  • H100 performance for some ConvNets in TF32 precision is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for ResNeXt-50 QAT networks in INT8, FP16, and FP32 precision at batch-size = 1 compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • H100 performance for some Transformers in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some ConvNets containering depthwise convolutions (like QuartzNets and EfficientDet-D0) in INT8 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some 3DUnets is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for OpenRoadNet networks in TF32 precision compared to TensorRT 8.4 on NVIDIA Ampere architecture GPUs.
  • There is an up to 6% performance drop for T5 networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs due to a functionality fix.
  • There is an up to 5% performance drop for UNet networks in INT8 precision with explicit quantization on CUDA 11.x compared to CUDA 10.2 on NVIDIA Turing GPUs.
  • There is an up to 6% performance drop for WaveRNN networks in FP16 precision compared to TensorRT 8.4 on CUDA 11.8 on NVIDIA Volta GPUs. Downgrading CUDA to CUDA 11.6 fixes the issue.
  • There is an up to 13% performance drop for Megatron networks in FP16 precision on Tesla T4 GPUs when disableExternalTacticSourcesForCore0805 is enabled.
  • There is an up to 16% performance drop for LSTM networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Pascal GPUs.
  • There is an up to 17% performance drop for LSTM on Windows in FP16 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • There is an up to 7% performance drop for Artifact Reduction networks involving Deconvolution ops in INT8 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • With the kFASTER_DYNAMIC_SHAPES_0805 preview feature enabled on the GPT style decoder models, there can be an up to 20% performance regression for odd sequence lengths only compared to TensorRT without the use of the preview feature.

TensorRT Release 8.5.1

These are the TensorRT 8.5.1 Release Notes and are applicable to x86 Linux, Windows, JetPack, and PowerPC Linux users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added support for NVIDIA Hopper™ (H100) architecture. TensorRT now supports compute capability 9.0 deep learning kernels for FP32, TF32, FP16, and INT8, using the H100 Tensor Cores and delivering increased MMA throughput over A100. These kernels also benefit from new H100 features such as Asynchronous Transaction Barriers, Tensor Memory Accelerator (TMA), and Thread Block Clusters for increased efficiency.
  • Added support for NVIDIA Ada Lovelace architecture. TensorRT now supports compute capability 8.9 deep learning kernels for FP32, TF32, FP16, and INT8.
  • Ubuntu 22.04 packages are now provided in both the CUDA network repository and in the local repository format starting with this release.
  • Shapes of tensors can now depend on values computed on the GPU. For example, the last dimension of the output tensor from INonZeroLayer depends on how many input values are non-zero. For more information, refer to the Dynamically Shaped Output section in the TensorRT Developer Guide.
  • TensorRT supports named input dimensions. In an ONNX model, two dimensions with the same named dimension parameter are considered equal. For more information, refer to the Named Dimensions section in the TensorRT Developer Guide.
  • TensorRT supports offloading the IShuffleLayer to DLA. Refer to the Layer Support and Restrictions section in the TensorRT Developer Guide for details on the restrictions for running IShuffleLayer on DLA.
  • Added the following layers:
    • IGatherLayer, ISliceLayer, IConstantLayer, and IConcatenation layers have been updated to support boolean types.
    • INonZeroLayer, INMSLayer (non-max suppression), IOneHotLayer, and IGridSampleLayer.
    For more information, refer to the TensorRT Operator’s Reference.
  • TensorRT supports heuristic-based builder tactic selection. This is controlled by nvinfer1::BuilderFlag::kENABLE_TACTIC_HEURISTIC. For more information, refer to the Tactic Selection Heuristic section in the TensorRT Developer Guide.
  • The builder timing cache has been updated to support transformer-based networks such as BERT and GPT. For more information, refer to the Timing Cache section in the TensorRT Developer Guide.
  • TensorRT supports the RoiAlign ONNX operator through the newly added RoiAlign plug-in. Both opset-10 and opset-16 versions of the operator are supported. For more information about the supported ONNX operators, refer to GitHub.
  • TensorRT supports disabled external tactic sources including cuDNN and cuBLAS use in the core library and allows the usage of cuDNN and cuBLAS in a plug-in by setting the preview feature flag: nvinfer1::PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 using IBuilderConfig::setPreviewFeature.
  • TensorRT supports a new preview feature nvinfer1::PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805, which aims to reduce build time, runtime and memory requirements for dynamic shaped transformer-based networks
  • TensorRT supports Lazy Module Loading; a CUDA feature, which can significantly reduce the amount of GPU memory consumed. Refer to the CUDA Programming Guide and the Lazy Module Loading section in the TensorRT Developer Guide for more information.
  • TensorRT supports persistent cache; a CUDA feature, which allows data cached in L2 persistently. Refer to the Persistent Cache Management section in the TensorRT Developer Guide for more information.
  • TensorRT supports a preview feature API, which is a mechanism that enables you to opt in for specific experimental features. Refer to the Preview Features section in the TensorRT Developer Guide for more information.
  • The following C++ API functions were added:
    • ITensor::setDimensionName()
    • ITensor::getDimensionName()
    • IResizeLayer::setCubicCoeff()
    • IResizeLayer::getCubicCoeff()
    • IResizeLayer::setExcludeOutside()
    • IResizeLayer::getExcludeOutside()
    • IBuilderConfig::setPreviewFeature()
    • IBuilderConfig::getPreviewFeature()
    • ICudaEngine::getTensorShape()
    • ICudaEngine::getTensorDataType()
    • ICudaEngine::getTensorLocation()
    • ICudaEngine::isShapeInferenceIO()
    • ICudaEngine::getTensorIOMode()
    • ICudaEngine::getTensorBytesPerComponent()
    • ICudaEngine::getTensorComponentsPerElement()
    • ICudaEngine::getTensorFormat()
    • ICudaEngine::getTensorFormatDesc()
    • ICudaEngine::getProfileShape()
    • ICudaEngine::getNbIOTensors()
    • ICudaEngine::getIOTensorName()
    • IExecutionContext::getTensorStrides()
    • IExecutionContext::setInputShape()
    • IExecutionContext::getTensorShape()
    • IExecutionContext::setTensorAddress()
    • IExecutionContext::getTensorAddress()
    • IExecutionContext::setInputTensorAddress()
    • IExecutionContext::getOutputTensorAddress()
    • IExecutionContext::inferShapes()
    • IExecutionContext::setInputConsumedEvent()
    • IExecutionContext::getInputConsumedEvent()
    • IExecutionContext::setOutputAllocator()
    • IExecutionContext::getOutputAllocator()
    • IExecutionContext::getMaxOutputSize()
    • IExecutionContext::setTemporaryStorageAllocator()
    • IExecutionContext::getTemporaryStorageAllocator()
    • IExecutionContext::enqueueV3()
    • IExecutionContext::setPersistentCacheLimit()
    • IExecutionContext::getPersistentCacheLimit()
    • IExecutionContext::setNvtxVerbosity()
    • IExecutionContext::getNvtxVerbosity()
    • INetworkDefinition::addOneHot()
    • INetworkDefinition::addNonZero()
    • INetworkDefinition::addGridSample()
    • INetworkDefinition::addNMS()
  • The following C++ classes were added:
    • IOneHotLayer
    • IGridSampleLayer
    • INonZeroLayer
    • INMSLayer
    • IOutputAllocator
  • The following C++ enum values were added:
    • InterpolationMode::kCUBIC
    • FillOperation::kRANDOM_NORMAL
    • BuilderFlag::kREJECT_EMPTY_ALGORITHMS
    • BuilderFlag::kENABLE_TACTIC_HEURISTIC
    • TacticSource::kJIT_CONVOLUTIONS
    • DataType::kUINT8
  • The following C++ enum classes were added:
    • TensorIOMode
    • PreviewFeature
  • The following Python API functions/properties were added:
    • ITensor.set_dimension_name()
    • ITensor.get_dimension_name()
    • IResizeLayer.cubic_coeff
    • IResizeLayer.exclude_outside
    • IBuilderConfig.set_preview_feature()
    • IBuilderConfig.get_preview_feature()
    • ICudaEngine.get_tensor_shape()
    • ICudaEngine.get_tensor_dtype()
    • ICudaEngine.get_tensor_location()
    • ICudaEngine.is_shape_inference_io()
    • ICudaEngine.get_tensor_mode()
    • ICudaEngine.get_tensor_bytes_per_component()
    • ICudaEngine.get_tensor_components_per_element()
    • ICudaEngine.get_tensor_format()
    • ICudaEngine.get_tensor_format_desc()
    • ICudaEngine.get_tensor_profile_shape()
    • ICudaEngine.num_io_tensors
    • ICudaEngine.get_tensor_name()
    • IExecutionContext.get_tensor_strides()
    • IExecutionContext.set_input_shape()
    • IExecutionContext.get_tensor_shape()
    • IExecutionContext.set_tensor_address()
    • IExecutionContext.get_tensor_address()
    • IExecutionContext.infer_shapes()
    • IExecutionContext.set_input_consumed_event()
    • IExecutionContext.get_input_consumed_event()
    • IExecutionContext.set_output_allocator()
    • IExecutionContext.get_output_allocator()
    • IExecutionContext.get_max_output_size()
    • IExecutionContext.temporary_allocator
    • IExecutionContext.execute_async_v3()
    • IExecutionContext.persistent_cache_limit
    • IExecutionContext.nvtx_verbosity
    • INetworkDefinition.add_one_hot()
    • INetworkDefinition.add_non_zero()
    • INetworkDefinition.add_grid_sample()
    • INetworkDefinition.add_nms()
  • The following Python classes were added:
    • IOneHotLayer
    • IGridSampleLayer
    • INonZeroLayer
    • INMSLayer
    • IOutputAllocator
  • The following Python enum values were added:
    • InterpolationMode.CUBIC
    • FillOperation.RANDOM_NORMAL
    • BuilderFlag.REJECT_EMPTY_ALGORITHMS
    • BuilderFlag.ENABLE_TACTIC_HEURISTIC
    • TacticSource.JIT_CONVOLUTIONS
    • DataType.UINT8
  • The following Python enum classes were added:
    • TensorIOMode
    • PreviewFeature
  • Removed the TensorRT layers chapter from the TensorRT Developer Guide appendix section and created a standalone TensorRT Operator’s Reference document.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.
  • APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to the DLA Supported Layers section in the TensorRT Developer Guide for details.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, “Unable to load library: nvinfer_builder_resource.dll”, if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or later).
  • The DLA compiler is capable of removing identity transposes, but it cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshape). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshapes in between. The shuffle layer is translated into two consecutive DLA transpose layers, unless you merge the transposes together manually in the model definition in advance.
  • In QAT networks, for group convolution that has a Q/DQ pair before but no Q/DQ pair after, we can run in INT8-IN-FP32-OUT mix precision before. However, GPU kernels may be missed and fall back to FP32-IN-FP32-OUT in the NVIDIA Hopper™ architecture GPUs if the input channel is small. This will be fixed in the future release.
  • On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and PyTorch are unable to run due to missing Python module dependencies. These frameworks have not been built for PowerPC and/or published to standard repositories.

Deprecated and Removed Features

The following features are deprecated in TensorRT 8.5.1:
  • TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x) devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT 9.0.

Fixed Issues

  • TensorRT’s optimizer would sometimes incorrectly retarget Concat layer inputs produced by Cast layers, resulting in data corruption. This has been fixed in this release.
  • There was an up to 5% performance drop for the ShuffleNet network compared to TensorRT 8.2 when running in INT8 precision on NVIDIA Ampere architecture GPUs. This has been fixed in this release.
  • There was an up to 10% performance difference for the WaveRNN network between different OS when running in FP16 precision on NVIDIA Ampere architecture GPUs. This issue has been fixed in this release.
  • When the TensorRT static library was used to build engines and the NVPTXCompiler static library was used outside of the TensorRT core library at the same time, it was possible to trigger a crash of the process in rare cases. This issue has been fixed in this release.
  • There was a known issue when ProfilingVerbosity is set to kDETAILED, the enqueueV2() call might take up to 2 ms compared to ProfilingVerbosity=kNONE or kLAYER_NAMES_ONLY. Now, you can use the setNvtxVerbosity() API to disable the costly detailed NVTX generation at runtime without the need to rebuild the engine.
  • There was a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode
    • up to 10% in FP16 mode on NVIDIA Maxwell® and NVIDIA Pascal® GPUs
    This issue has been fixed in this release.
  • There was an issue when performing PTQ with TensorRT with tensors rank > 4, some layers may cause an assertion about invalid Region Dims. Empty tensors in the network may also cause a seg fault within the INT8 calibration process. These issues were fixed in this release.
  • When performing an L2_Normalization in float16 precision, there was undefined behavior occurring from a fusion. This fusion could be disabled by marking the input to the L2_Normalization as a network output. This issue has been fixed in this release.
  • TensorRT used to incorrectly allow subnetworks of the input network with > 16 I/O tensors to be offloaded to DLA, due to intermediate tensors of the subnetwork that were also the full network outputs not having been counted as I/O tensors. This has been fixed. You may infrequently experience a slightly increased fragmentation.
  • There was a ~19% performance drop on NVIDIA Turing GPUs for the DRIVENet network in FP16 precision compared to TensorRT 8.4. This regression has been fixed in this release.
  • On NVIDIA Hopper GPUs, the QKV plug-in, which is used by the open-source BERT demo, would fail for fixed sequence lengths of 128 and 384 with FP16. This issue has been fixed in this release.
  • To compile the DLA samples, you can now use the listed commands in the README.
  • In convolution or GEMM layers, some extra tactic information would be printed out. This issue has been fixed in this release.
  • When using QAT, horizontal fusion of convolutions at the end of the net would incorrectly propagate the quantization scales of the weights, resulting in incorrect outputs. This issue has been fixed in this release.
  • With TensorRT ONNX runtime, building an engine for a large graphic would fail due to the implementation of a foreign node not being found. This issue has been fixed in this release.
  • The BatchedNMSDynamicPlugin supports dynamic batch only. The behavior was undefined if other dimensions were set to be dynamic. This issue has been fixed in this release.
  • On NVIDIA Hopper GPUs, the QKV plug-in, which is used by the open-source BERT demo, would produce inaccurate results for sequence lengths other than 128 and 384. This issue has been fixed in this release.
  • A new tactic source kJIT_CONVOLUTIONS was added, however, enabling or disabling it had no impact as it was still in development. This issue has been fixed in this release.
  • There was a known issue when using INT8 calibration for networks with ILoop or IIfConditional layers. This issue has been fixed in this release.
  • There was a ~17% performance drop on NVIDIA Ampere GPUs for the inflated 3D video classification network in TF32 precision compared to TensorRT 8.4.
  • There were up to 25% performance drops for various networks on SBSA systems compared to TensorRT 8.4.
  • There was a ~15% performance drop on NVIDIA Ampere GPUs for the ResNet_v2_152 network in TF32 precision compared to TensorRT 8.4.
  • There was a 17% performance drop for networks containing Deconv+Concat or Slice+Deconv patterns.
  • There was a ~9% performance drop on Volta and Turing GPUs for the WaveRNN network.
  • Some networks would see a small increase in deserialization time. This issue has been fixed in this release.
  • When using QAT, horizontal fusion of two or more convolutions that have quantized inputs and non-quantized outputs would result in incorrect weights quantization. This issue has been fixed in this release.
  • When invoking IQuantizeLayer::setAxis with the axis set to -1, the graph optimization process triggered an assertion. This issue has been fixed in this release.
  • If the values (not just dimensions) of an output tensor from a plugin were used to compute a shape, the engine would fail to build. This issue has been fixed in this release.

Announcements

  • In the next TensorRT release, CUDA toolkit 10.2 support will be dropped.
  • TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x) devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT 9.0.
  • In the next TensorRT release, cuDNN, cuBLAS, and cuBLASLt tactic sources will be turned off by default in builder profiling. TensorRT plans to remove the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. Use the PreviewFeature flag kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to evaluate the functional and performance impact of disabling cuBLAS and cuDNN and report back to TensorRT if there are critical regressions in your use cases.
  • TensorRT Python wheel files before TensorRT 8.5, such as TensorRT 8.4, were published to the NGC PyPI repo. Starting with TensorRT 8.5, Python wheels will instead be published to upstream PyPI. This will make it easier to install TensorRT because it requires no prerequisite steps. Also, the name of the Python package for installation has changed from nvidia-tensorrt to just tensorrt.
  • The C++ and Python API documentation in previous releases was included inside the tar file packaging. This release no longer bundles the documentation inside the tar file since the online documentation can be updated post release and avoids encountering mistakes found in stale documentation inside the packages.

Known Issues

Functional
  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • The Python sample yolov3_onnx has a known issue when installing the requirements with Python 3.10. The recommendation is to use a Python version < 3.10 when running the sample.
  • The auto-tuner assumes that the number of indices returned by INonZeroLayer is half of the number of input elements. Thus, networks that depend on tighter assumptions for correctness may fail to build.
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.
  • One of the deconvolution algorithms sourced from cuDNN exhibits non-deterministic execution. Disabling cuDNN tactics will prevent this algorithm from being chosen (refer to IBuilderConfig::setTacticSources).
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fall back to using cuBLAS. (not applicable for Jetson platforms)
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
  • When using DLA, an elementwise, unary, or activation layer immediately followed by a scale layer may lead to accuracy degradation in INT8 mode. Note that this is a pre-existing issue also found in previous releases rather than a regression.
  • When using DLA, INT8 convolutions followed by FP16 layers may cause accuracy degradation. In such cases, either change the convolution to FP16 or the subsequent layer to INT8.
  • When using the algorithm selector API, the HWC1 and HWC4 DLA formats are both reported as TensorFormat::kDLA_HWC4.
  • For transformer decoder based models (such as GPT2) with sequence length as dynamic, TensorRT 8.5 requires additional workspace (up to 2x) as compared to previous releases.
Performance
  • There is a ~12% performance drop on NVIDIA Ampere architecture GPUs for the BERT network on Windows systems.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 22% performance drop for Jasper networks compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 5% performance drop for the InceptionV4 network compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is an up to 20% performance variation between different engines built from the same network for some LSTM networks due to unstable tactic selections.
  • There is an up to 11% performance variation for some LSTM networks during inference depending on the order of CUDA stream creation on NVIDIA Turing GPUs. This will be fixed in r525 drivers.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is an up to 10-11% performance regression on Xavier compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 5% performance drop for Megatron networks in FP32 precision at batch-size = 1 between CUDA 11.8 and CUDA 10.2 on NVIDIA Volta GPUs. This performance drop does not happen on NVIDIA Turing or later GPUs.
  • There is an up to 23% performance drop between H100 and A100 for some ConvNets in TF32 precision when running at the same SM clock frequency. This will be improved in future TRT versions.
  • There is an up to 8% performance drop between H100 and A100 for some transformers, including BERT, BART, T5, and GPT2, in FP16 precision at BS=1 when running at the same SM clock frequency. This will be improved in future TensorRT versions.
  • H100 performance for some ConvNets in TF32 precision is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for ResNeXt-50 QAT networks in INT8, FP16, and FP32 precision at batch-size = 1 compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • H100 performance for some Transformers in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some ConvNets containering depthwise convolutions (like QuartzNets and EfficientDet-D0) in INT8 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some 3DUnets is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for OpenRoadNet networks in TF32 precision compared to TensorRT 8.4 on NVIDIA Ampere architecture GPUs.
  • There is an up to 6% performance drop for T5 networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs due to a functionality fix.
  • There is an up to 5% performance drop for UNet networks in INT8 precision with explicit quantization on CUDA 11.x compared to CUDA 10.2 on NVIDIA Turing GPUs.
  • There is an up to 6% performance drop for WaveRNN networks in FP16 precision compared to TensorRT 8.4 on CUDA 11.8 on NVIDIA Volta GPUs. Downgrading CUDA to CUDA 11.6 fixes the issue.
  • There is an up to 13% performance drop for Megatron networks in FP16 precision on Tesla T4 GPUs when disableExternalTacticSourcesForCore0805 is enabled.
  • There is an up to 16% performance drop for LSTM networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Pascal GPUs.
  • There is an up to 17% performance drop for LSTM on Windows in FP16 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • There is an up to 7% performance drop for Artifact Reduction networks involving Deconvolution ops in INT8 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • With the kFASTER_DYNAMIC_SHAPES_0805 preview feature enabled on the GPT style decoder models, there can be an up to 20% performance regression for odd sequence lengths only compared to TensorRT without the use of the preview feature.

TensorRT Release 8.4.3

These are the TensorRT 8.4.3 Release Notes and is applicable to x86 Linux and Windows users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all non-batch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to the DLA Supported Layers section in the TensorRT Developer Guide for details.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, “Unable to load library: nvinfer_builder_resource.dll”, if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • The builder may require up to 60% more memory to build an engine.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or later).
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in NVIDIA JetPack 4.5 for ResNet-like networks on NVIDIA DLA on Xavier platforms when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues. NVIDIA Orin platforms are not affected by this.

Fixed Issues

  • When parsing networks with ONNX operand expand on scalar input. TensorRT would error out. This issue has been fixed in this release.
  • The custom ClipPlugin used in the uff_custom_plugin sample had an issue with a plugin parameter not being serialized, leading to a failure when the plugin needed to be deserialized. This issue has been fixed with proper serialization/deserialization.
  • When working with transformer based networks with multiple dynamic dimensions, if the network had shuffle operations which caused one or more dimensions to be a coalesced dimension (combination of multiple dynamic dimensions) and if this shuffle was further used in a reduction operation such as MatrixMultiply layer, it can potentially lead to corruption of results. This issue has been fixed in this release.
  • When working with recurrent networks containing Loops and Fill layers, it was possible that the engine may have failed to build. This issue has been fixed in this release.
  • In some rare cases when converting a MatrixMultiply layer to a Convolution layer for optimization purposes, the shapes may fail to inference. This issue has been fixed in this release.
  • In some cases, Tensor memory was not zero initialized for vectorized dimensions. This resulted in NaN in the output tensor during engine execution. This issue has been fixed in this release.
  • For the HuggingFace demos, the T5-3B model had only been verified on A100, and was not expected to work on A10, T4, and so on. This issue has been fixed in this release.
  • Certain spatial dimensions may have caused crashes during DLA optimization for models using single-channel inputs. This issue has been fixed in this release.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may have created an internal error causing the application to crash while building the engine. This issue has been fixed in this release.
  • For some networks using sparsity, TensorRT may have produced inaccurate results. This issue has been fixed in this release.

Announcements

  • CUDA 11.7 added a feature called Lazy loading, however, this feature is not supported by TensorRT 8.4 because the CUDA 11.x binaries were built with CUDA Toolkit 11.6.

Known Issues

Functional
  • When performing an L2_Normalization in float16 precision, there is undefined behavior occurring from a fusion. This fusion can be disabled by marking the input to the L2_Normalization as a network output.
  • When performing PTQ with TensorRT with tensors rank > 4, some layers may cause an assertion about invalid Region Dims. This can be worked around by fusing the index layers into the 4th dimension to have the tensor have a rank 4.
  • SM75 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes which quantize the failing layers.
  • When the TensorRT static library is used to build engines and the NVPTXCompiler static library is also used outside of the TensorRT core library at the same time, it is possible to trigger a crash of the process in rare cases.
  • TensorRT should only allow up to a total of 16 I/O tensors for a single subnetwork offloaded to DLA. However, there is a leak in the logic that incorrectly allows > 16 I/O tensors. You may need to manually specify the per layer device to avoid the creation of subnetworks with over 16 I/O tensors, for successful engine construction. This restriction will be properly reinstated in a future release.
  • One of the deconvolution algorithms sourced from cuDNN exhibits non-deterministic execution. Disabling cuDNN tactics will prevent this algorithm from being chosen (refer to IBuilderConfig::setTacticSources).
  • Due to ABI compatibility issues, static builds are not supported on SBSA platforms.
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known issue when ProfilingVerbosity is set to kDETAILED, the enqueueV2() call may take up to 2ms compared to ProfilingVerbosity=kNONE or kLAYER_NAMES_ONLY.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fall back to using cuBLAS. (not applicable for Jetson platforms)
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
Performance
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 22% performance drop for Jasper networks compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 5% performance drop for the InceptionV4 network compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 5% performance drop for the ShuffleNet network compared to TensorRT 8.2 when running in INT8 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere Architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is an up to 10% performance difference for the WaveRNN network between different operating systems when running in FP16 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is an up to 7% performance regression for the 3D-UNet networks compared to TensorRT 8.4 EA when running in INT8 precision on NVIDIA Orin due to a functionality fix.
  • There is an up to 20% performance variation between different engines built from the same network for some LSTM networks when running on Windows due to unstable tactic selections.
  • Some networks may see a small increase in deserialization time.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and NVIDIA Pascal GPUs.
  • There is an up to 10-11% performance regression on Xavier compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.

TensorRT Release 8.4.2

These are the TensorRT 8.4.2 Release Notes and is applicable to x86 Linux and Windows users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added samples:
    • tensorflow_object_detection_api, which demonstrates the conversion and execution of the Tensorflow Object Detection API Model Zoo models with TensorRT. For information about how this sample works, sample code, and step-by-step instructions on how to run and verify its output, refer to the GitHub: tensorflow_object_detection_api/README.md file.
    • detectron2, which demonstrates the conversion and execution of the Detectron 2 Model Zoo Mask R-CNN R50-FPN 3x model with TensorRT. For information about how this sample works, sample code, and step-by-step instructions on how to run and verify its output, refer to the GitHub: detectron2/README.md file.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all non-batch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to the DLA Supported Layers section in the TensorRT Developer Guide for details.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, “Unable to load library: nvinfer_builder_resource.dll”, if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • The builder may require up to 60% more memory to build an engine.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or later).
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in NVIDIA JetPack 4.5 for ResNet-like networks on NVIDIA DLA on Xavier platforms when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues. NVIDIA Orin platforms are not affected by this.

Fixed Issues

  • The standalone Python wheel files for TensorRT 8.4.1 were much larger than necessary. We have removed some duplication within the Python wheel files, which has resulted in a file size reduction.
  • When parsing networks with random fill nodes defined within conditionals, TensorRT would error out. The issue has been fixed and these networks can now successfully compile.
  • When using multiple Convolution layers using the same input and wrapped with Q/DQ layers, TensorRT could have produced inaccurate results. This issue has been fixed in this release.
  • Calling a Max Reduction on a shape tensor that has a non-power of two volumes of index dimensions could produce undefined results. This issue has been fixed in this release.
  • There was a known regression with the encoder model. The encoder model could be built successfully with TensorRT 8.2 but would fail with TensorRT 8.4. This issue has been fixed in this release.
  • An assertion error occurred when the constant folding of Boolean type for the slice operation was not enabled. The constant folding of Boolean type for the slice op is now enabled.
  • Parsing ONNX models with conditional nodes that contained the same initializer names would sometimes produce incorrect results. This issue has been fixed in this release.
  • An assertion in TensorRT occurred when horizontally fusing Convolution or Matrix Multiplication operations that have weights in different precisions. This issue has been fixed in this release.
  • Certain models including but not limited to those with loops or conditionals were susceptible to an allocation-related assertion failure due to a race condition. This issue has been fixed in this release.
  • When using the IAlgorithmSelector interface, if BuildFlag::kREJECT_EMPTY_ALGORITHMS was not set, an assertion occurred where the number of algorithms is zero. This issue has been fixed in this release.

Announcements

  • CUDA 11.7 added a feature called Lazy loading, however, this feature is not supported by TensorRT 8.4 because the CUDA 11.x binaries were built with CUDA Toolkit 11.6.

Known Issues

Functional
  • When performing an L2_Normalization in float16 precision, there is undefined behavior occurring from a fusion. This fusion can be disabled by marking the input to the L2_Normalization as a network output.
  • When performing PTQ with TensorRT with tensors rank > 4, some layers may cause an assertion about invalid Region Dims. This can be worked around by fusing the index layers into the 4th dimension to have the tensor have a rank 4.
  • SM75 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes which quantize the failing layers.
  • For some networks using sparsity, TensorRT may produce inaccurate results.
  • When the TensorRT static library is used to build engines and the NVPTXCompiler static library is also used outside of the TensorRT core library at the same time, it is possible to trigger a crash of the process in rare cases.
  • TensorRT should only allow up to a total of 16 I/O tensors for a single subnetwork offloaded to DLA. However, there is a leak in the logic that incorrectly allows > 16 I/O tensors. You may need to manually specify the per layer device to avoid the creation of subnetworks with over 16 I/O tensors, for successful engine construction. This restriction will be properly reinstated in a future release.
  • One of the deconvolution algorithms sourced from cuDNN exhibits non-deterministic execution. Disabling cuDNN tactics will prevent this algorithm from being chosen (refer to IBuilderConfig::setTacticSources).
  • Some models may fail on SBSA platforms when using statically linked binaries.
  • For the HuggingFace demos, the T5-3B model has only been verified on A100, and is not expected to work on A10, T4, and so on.
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known issue when ProfilingVerbosity is set to kDETAILED, the enqueueV2() call may take up to 2ms compared to ProfilingVerbosity=kNONE or kLAYER_NAMES_ONLY.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fall back to using cuBLAS. (not applicable for Jetson platforms)
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
  • For some networks, using batch sizes larger than 32 may cause accuracy degradation on DLA.
  • Certain spatial dimensions may cause crashes during DLA optimization for models using single-channel inputs.
Performance
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 22% performance drop for Jasper networks compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 5% performance drop for the InceptionV4 network compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 5% performance drop for the ShuffleNet network compared to TensorRT 8.2 when running in INT8 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere Architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is an up to 10% performance difference for the WaveRNN network between different operating systems when running in FP16 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is an up to 7% performance regression for the 3D-UNet networks compared to TensorRT 8.4 EA when running in INT8 precision on NVIDIA Orin due to a functionality fix.
  • There is an up to 20% performance variation between different engines built from the same network for some LSTM networks when running on Windows due to unstable tactic selections.
  • Some networks may see a small increase in deserialization time.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and NVIDIA Pascal GPUs.
  • There is an up to 10-11% performance regression on Xavier compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.

TensorRT Release 8.4.1

These are the TensorRT 8.4.1 Release Notes and is applicable to x86 Linux and Windows users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added sampleOnnxMnistCoordConvAC, which contains custom CoordConv layers. It converts a model trained on the MNIST dataset in ONNX format to a TensorRT network and runs inference on the network. Scripts for generating the ONNX model are also provided.
  • The DLA slice layer is supported for DLA 3.9.0 and later. The DLA SoftMax layer is supported for DLA 1.3.8.0 and later. Refer to the DLA Supported Layers section in the TensorRT Developer Guide for details.
  • Added a Revision History section to the TensorRT Developer Guide to help identify content that’s been added or updated since the previous release.
  • Reduced engine file size and runtime memory use on some networks with large spatial dimension convolution or deconvolution layers.
  • The following C++ API functions and enums were added:
  • The following Python API functions and enums were added:
  • Improved the performance of some convolutional neural networks trained in TensorFlow and exported using the tf2onnx tool or running in TF-TRT.
  • Added the --layerPrecisions and --layerOutputTypes flags to the trtexec tool to allow you to specify layer-wise precision constraints and layer-wise output type constraints.
  • Added the --memPoolSize flag to the trtexec tool to allow you to specify the size of the workspace as well as the DLA memory pools using a unified interface.
  • Added a new interface to customize and query the sizes of the three DLA memory pools: managed SRAM, local DRAM, and global DRAM. For consistency with past behavior, the pool sizes apply per-subgraph (that is, per-loadable). Upon loadable compilation success, the builder reports the actual amount of memory used per pool by each loadable, thus allowing for fine-tuning; upon failure due to insufficient memory a message will be emitted.

    There are also changes outside the scope of DLA: the existing API to specify and query the workspace size (setMaxWorkspaceSize, getMaxWorkspaceSize) has been deprecated and integrated into the new API. Also, the default workspace size has been updated to the device-global memory size, and the TensorRT samples have had their specific workspace sizes removed in favor of the new default value. Refer to Customizing DLA Memory Pools section in the NVIDIA TensorRT Developer Guide for more details.

  • Added support for NVIDIA BlueField®-2 data processing units (DPUs), both A100X and A30X variants when using the Arm Server Base System Architecture (SBSA) packages.
  • Added support for NVIDIA JetPack 5.0 users. NVIDIA Xavier and NVIDIA Orin™ based devices are supported.
  • Added support for the dimensions labeled with the same subscript in IEinsumLayer to be broadcastable.
  • Added-asymmetric padding support for 3D or dilated deconvolution layers on sm70+ GPUs, when the accumulation of kernel size is equal to or less than 32.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all non-batch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to the DLA Supported Layers section in the TensorRT Developer Guide for details.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, “Unable to load library: nvinfer_builder_resource.dll”, if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • The builder may require up to 60% more memory to build an engine.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or later).
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in NVIDIA JetPack 4.5 for ResNet-like networks on NVIDIA DLA on Xavier platforms when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues. NVIDIA Orin platforms are not affected by this.

Deprecated and Removed Features

The following features are deprecated in TensorRT 8.4.1:
  • Removed sampleNMT.
  • The End-To-End Host Latency metric in trtexec output has been removed to avoid confusion. Use the “Host Latency” metric instead for performance metric. For more information, refer to the Benchmarking Network section in the TensorRT Developer Guide.
  • CentOS Linux 8 has reached End-of-Life on Dec 31, 2021. Support for this OS will be deprecated in the next TensorRT release. CentOS Linux 8 support will be completely removed in a future release.
  • In previous TensorRT releases, PDF documentation was included inside the TensorRT package. The PDF documentation has been removed from the package in favor of online documentation, which is updated regularly. Online documentation can be found at https://docs.nvidia.com/deeplearning/tensorrt/index.html.
  • The TensorRT shared library files no longer have RUNPATH set to $ORIGIN. This setting was causing unintended behavior for some users. If you relied on this setting before you may have trouble with missing library dependencies when loading TensorRT. It is preferred that you manage your own library search path using LD_LIBRARY_PATH or a similar method.

Fixed Issues

  • If the TensorRT Python bindings were used without a GPU present, such as when the NVIDIA Container Toolkit is not installed or enabled before running Docker, then you may have encountered an infinite loop that required the process to be killed in order to terminate the application. This issue has been fixed in this release.
  • The EngineInspector detailed layer information always showed batch size = 1 when the engine was built with implicit batch dimensions. This issue has been fixed in this release.
  • The IElementWiseLayer and IUnaryLayer layers can accept different input datatypes depending on the operation that is used. The documentation was updated to explicitly show which datatypes are supported. For more information, refer to the IElementWiseLayer and IUnaryLayer sections in the TensorRT Developer Guide.
  • When running ONNX models with dynamic shapes, there was a potential accuracy issue if the dimension names of the inputs that were expected to be the same were not. For example, if a model had two 2D inputs of which the dimension semantics were both batch and seqlen, and in the ONNX model, the dimension name of the two inputs were different, there was a potential accuracy issue when running with dynamic shapes. This issue has been fixed in this release.
  • There was an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer. This regression has been fixed in this release.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior. This issue has been fixed in this release. TensorRT does not archive public libnvptxcompiler_static.a and libnvrtc_static.a into libnvinfer_static.a.
  • There was an up to 10% performance regression for ResNeXt networks with small batch (1 or 2) in FP32 compared to TensorRT 6 on Xavier. This regression has been fixed in this release.
  • TensorRT could have experienced some instability when running networks containing TopK layers on T4 under Azure VM. This issue has been fixed in this release.
  • There was a potential memory leak while running models containing the Einsum op. This issue has been fixed in this release.
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect was that the TensorRT optimizer was able to choose layer implementations that use more memory, which could cause the OOM Killer to trigger for networks where it previously did not. This issue has been fixed in this release.
  • TensorRT had limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and FullyConnected layers had to be fused. Therefore, if a weights shuffle pattern was not supported, it may lead to failure to quantize the layer. This issue has been fixed in this release
  • Networks that used certain pointwise operations not preceded by convolutions or deconvolutions and followed by slicing on spatial dimensions could crash in the optimizer. This issue has been fixed in this release.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may have encountered Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The README for these samples have been updated with instructions on how to install a GPU enabled version of PyTorch.
  • Intermittent accuracy issues were observed in sample_mnist with INT8 precision on WSL2. This issue has been fixed in this release.
  • The TensorRT plug-ins library used a logger that was not thread-safe and that could cause data races. This issue has been fixed in this release.
  • For a quantized (QAT) network with ConvTranspose followed by BN, ConvTranspose would be quantized first and then BN would be fused to ConvTranspose. This fusion was wrong and caused incorrect outputs. This issue has been fixed in this release.
  • During the graph optimization, new nodes were added but there was no mechanism preventing the duplication of node names. This issue has been fixed in this release.
  • For some networks with large amounts of weights and activation data, TensorRT failed compiling a subgraph, and that subgraph would fallback to GPU. Now, rather than the whole subgraph fallback to GPU, only the single node that cannot be run with DLA will fallback to GPU.
  • There was a known functional issue when running networks containing 3D deconvolution layers on L4T. This issue has been fixed in this release.
  • There was a known functional issue when running networks containing convolution layers on K80. This issue has been fixed in this release.
  • A small portion of the data of the inference results of the LSTM graph of a specific pattern was non-deterministic occasionally. This issue has been fixed in this release.
  • A small portion of the LSTM graph, in which multiple MatMul layers have opA/opB==kTRANSPOSE consuming the same input tensor, may have failed to build the engine. This issue has been fixed in this release.
  • For certain networks built for the Xavier GPU, the deserialized engine may have allocated more GPU memory than necessary. This issue has been fixed in this release.
  • If a network had a Gather layer with both indices and input dynamic, and the optimization profile had a large dynamic range (difference between max and min), TensorRT could request a very large workspace. This issue has been fixed in this release.

Announcements

  • Python support for Windows included in the zip package is ready for production use.
  • CUDA 11.7 added a feature called Lazy loading, however, this feature is not supported by TensorRT 8.4 because the CUDA 11.x binaries were built with CUDA Toolkit 11.6.

Known Issues

Functional
  • Calling a Max Reduction on a shape tensor that has a non-power of two volumes of index dimensions can produce undefined results. This can be fixed by padding the index dimensions to have a volume equal to a power of two.
  • When performing an L2_Normalization in float16 precision, there is undefined behavior occurring from a fusion. This fusion can be disabled by marking the input to the L2_Normalization as a network output.
  • When performing PTQ with TensorRT with tensors rank > 4, some layers may cause an assertion about invalid Region Dims. This can be worked around by fusing the index layers into the 4th dimension to have the tensor have a rank 4.
  • SM75 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes which quantize the failing layers.
  • For some networks using sparsity, TensorRT may produce inaccurate results.
  • When the TensorRT static library is used to build engines and the NVPTXCompiler static library is also used outside of the TensorRT core library at the same time, it is possible to trigger a crash of the process in rare cases.
  • TensorRT should only allow up to a total of 16 I/O tensors for a single subnetwork offloaded to DLA. However, there is a leak in the logic that incorrectly allows > 16 I/O tensors. You may need to manually specify the per layer device to avoid the creation of subnetworks with over 16 I/O tensors, for successful engine construction. This restriction will be properly reinstated in a future release.
  • When using multiple Convolution layers using the same input and wrapped with Q/DQ layers, TensorRT may produce inaccurate results
  • One of the deconvolution algorithms sourced from cuDNN exhibits non-deterministic execution. Disabling cuDNN tactics will prevent this algorithm from being chosen (refer to IBuilderConfig::setTacticSources).
  • Some models may fail on SBSA platforms when using statically linked binaries.
  • For the HuggingFace demos, the T5-3B model has only been verified on A100, and is not expected to work on A10, T4, and so on.
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known issue when ProfilingVerbosity is set to kDETAILED, the enqueueV2() call may take up to 2ms compared to ProfilingVerbosity=kNONE or kLAYER_NAMES_ONLY.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fall back to using cuBLAS. (not applicable for Jetson platforms)
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
  • For some networks, using batch sizes larger than 32 may cause accuracy degradation on DLA.
  • Certain spatial dimensions may cause crashes during DLA optimization for models using single-channel inputs.
Performance
  • There is a known regression with the encoder model. The encoder model can be built successfully with TensorRT 8.2 but fails with TensorRT 8.4.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 22% performance drop for Jasper networks compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 5% performance drop for the InceptionV4 network compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 5% performance drop for the ShuffleNet network compared to TensorRT 8.2 when running in INT8 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere Architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is an up to 10% performance difference for the WaveRNN network between different operating systems when running in FP16 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is an up to 7% performance regression for the 3D-UNet networks compared to TensorRT 8.4 EA when running in INT8 precision on NVIDIA Orin due to a functionality fix.
  • There is an up to 20% performance variation between different engines built from the same network for some LSTM networks when running on Windows due to unstable tactic selections.
  • Some networks may see a small increase in deserialization time.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and NVIDIA Pascal GPUs.
  • There is an up to 10-11% performance regression on Xavier compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.

TensorRT Release 8.4.0 Early Access (EA)

These are the TensorRT 8.4.0 Early Access (EA) Release Notes and are applicable to x86 Linux and Windows users. This release incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes.

These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

This EA release is for early testing and feedback. For production use of TensorRT, continue to use TensorRT 8.2.3 or later TensorRT 8.2.x patch.
Note: TensorRT 8.4 EA does not include updates to the CUDA network repository. You should use the local repo installer package instead.

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
  • Reduce engine file size and runtime memory usage on some networks with large spatial dimension convolution or deconvolution layers.
  • The following C++ API functions and enums were added:
  • The following Python API functions and enums were added:
  • Improved the performance of some convolutional neural networks trained in TensorFlow and exported using the tf2onnx tool or running in TF-TRT.
  • Added the --layerPrecisions and --layerOutputTypes flags to the trtexec tool to allow you to specify layer-wise precision constraints and layer-wise output type constraints.
  • Added the --memPoolSize flag to the trtexec tool to allow you to specify the size of the workspace as well as the DLA memory pools via a unified interface.
  • Added a new interface to customize and query the sizes of the three DLA memory pools: managed SRAM, local DRAM, and global DRAM. For consistency with past behavior, the pool sizes apply per-subgraph (i.e. per-loadable). Upon loadable compilation success, the builder reports the actual amount of memory used per pool by each loadable, thus allowing for fine-tuning; upon failure due to insufficient memory a message will be emitted.

    There are also changes outside the scope of DLA: the existing API to specify and query the workspace size (setMaxWorkspaceSize, getMaxWorkspaceSize) has been deprecated and integrated into the new API. Also, the default workspace size has been updated to the device global memory size, and the TensorRT samples have had their specific workspace sizes removed in favor of the new default value. Refer to Customizing DLA Memory Pools section in the NVIDIA TensorRT Developer Guide for more details.

  • Added support for NVIDIA BlueField®-2 data processing units (DPUs), both A100X and A30X variants when using the ARM Server Base System Architecture (SBSA) packages.
  • Added support for NVIDIA JetPack 5.0 users. NVIDIA Xavier and NVIDIA Orin™ based devices are supported.
  • Added support for the dimensions labeled with the same subscript in IEinsumLayer to be broadcastable.
  • Added asymmetric padding support for 3D or dilated deconvolution layers on sm70+ GPUs, when the accumulation of kernel size is equal to or less than 32.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • When static linking with cuDNN, cuBLAS, and cuBLASLt libraries, TensorRT requires CUDA >=11.3.
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • 3D Asymmetric Padding is not supported on GPUs older than the NVIDIA Volta GPU architecture (compute capability 7.0).

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.4.0 EA:
  • The following C++ API functions and classes were deprecated:
    • IFullyConnectedLayer
    • getMaxWorkspaceSize
    • setMaxWorkspaceSize
  • The following Python API functions and classes were deprecated:
    • IFullyConnectedLayer
    • get_max_workspace_size
    • set_max_workspace_size
  • The --workspace flag in trtexec has been deprecated. TensorRT now allocates as much workspace as available GPU memory by default when the --workspace/--memPoolSize flags are not added, instead of having 16MB default workspace size limit in the trtexec in TensorRT 8.2. To limit the workspace size, use the --memPoolSize=workspace:<size> flag instead.
  • The IFullyConnectedLayer operation is deprecated. Typically, you should replace it with IMatrixMultiplyLayer. The MatrixMultiply layer does not support all data layouts supported by the FullyConnected layer currently, so additional work may be required when using BuilderFlag::kDIRECT_IO, if the input of the MatrixMultiply layer is a network I/O tensor:
    • If the MatrixMultiply layer is forced to INT8 precision via a combination of
      • ILayer::setPrecision(DataType::kINT8)
      • IBuilderConfig::setFlag(BuilderFlag::kOBEY_PRECISION_CONSTRAINTS)
      the engine will fail to build.
    • If the MatrixMultiply layer is prefered to run on DLA and GPU fallback is allowed via a combination of
      • IBuilderConfig->setDeviceType(matrixMultiplyLayer,
            DeviceType::kDLA)
      • IBuilderConfig->setFlag(BuilderFlag::kGPU_FALLBACK)
      the layer will fall back to run on the GPU.
    • If the MatrixMultiply layer is required to run on DLA and GPU fallback is not allowed via
      • IBuilderConfig->setDeviceType(matrixMultiplyLayer,
            DeviceType::kDLA)
      the engine will fail to build.

      To resolve these issues, either relax one of the constraints, or use IConvolutionLayer to create a Convolution 1x1 layer to replace IFullyConnectedLayer.

      Refer to the MNIST API samples (C++, Python) for examples of migrating from IFullyConnectedLayer to IMatrixMultiplyLayer.

Fixed Issues

  • The EngineInspector detailed layer information always showed batch size = 1 when the engine was built with implicit batch dimension. This issue has been fixed in this release.
  • The IElementWiseLayer and IUnaryLayer layers can accept different input datatypes depending on the operation that is used. The documentation was updated to explicitly show which datatypes are supported. For more information, refer to the IElementWiseLayer and IUnaryLayer sections in the TensorRT Developer Guide.
  • When running ONNX models with dynamic shapes, there was a potential accuracy issue if the dimension names of the inputs that were expected to be the same were not. For example, if a model had two 2D inputs of which the dimension semantics were both batch and seqlen, and in the ONNX model, the dimension name of the two inputs were different, there was a potential accuracy issue when running with dynamic shapes. This issue has been fixed in this release.
  • There was an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer. This regression has been fixed in this release.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

Known Issues

Functional
  • There is a known functional issue when running networks containing 3D deconvolution layers on L4T.
  • There is a known functional issue when running networks containing convolution layers on K80.
  • A small portion of the data of the inference results of the LSTM graph of a specific pattern is non-deterministic occasionally.
  • If a network has a Gather layer with both indices and input dynamic and the optimization profile has a large dynamic range (difference between max and min), TensorRT could request a very large workspace.
  • For the HuggingFace demos, the T5-3B model has only been verified on A100, and is not expected to work on A10, T4, etc.
  • For a quantized (QAT) network with ConvTranspose followed by BN, ConvTranspose will be quantized first and then BN will be fused to ConvTranspose. This fusion is wrong and causes incorrect outputs.
  • A small portion of the LSTM graph, in which multiple MatMul layers have opA/opB==kTRANSPOSE consuming the same input tensor, may fail to build the engine.
  • During the graph optimization, new nodes are added but there is no mechanism preventing the duplication of node names.
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • The TensorRT plugins library uses a logger that is not thread-safe which can cause data races
  • There is a potential memory leak while running models containing the Einsum op.
  • There is a known issue when ProfilingVerbosity is set to kDETAILED, the enqueueV2() call may take up to 2ms compared to ProfilingVerbosity=kNONE or kLAYER_NAMES_ONLY.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • You may see the following error:
    Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
  • For some networks, using batch sizes larger than 32 may cause accuracy degradation on DLA.
  • Certain spatial dimensions may cause crashes during DLA optimization for models using single-channel inputs.
  • Networks that use certain pointwise operations not preceded by convolutions or deconvolutions and followed by slicing on spatial dimensions may crash in the optimizer.
  • The builder may require up to 60% more memory to build an engine.
  • If the TensorRT Python bindings are used without a GPU present, such as when the NVIDIA Container Toolkit is not installed or enabled before running Docker, then you may encounter an infinite loop which requires the process to be killed in order to terminate the application.
Performance
  • For certain networks built for the Xavier GPU, the deserialized engine may allocate more GPU memory than necessary.
  • Some networks may see a small increase in deserialization time.
  • Due to the difference in DLA hardware specification between Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package release notes for details.
  • For transformer based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on Turing GPUs.
  • If a Pointwise operation has 2 inputs, then a fusion may not be possible leading to lower performance. For example, MatMul and Sigmoid can typically be fused to ConvActFusion but not in this scenario.
  • There is an up to 15% performance regression for MaskRCNN-ResNet-101 on Turing GPUs in INT8 precision.
  • There is an up to 23% performance regression for Jasper networks on Volta and Turing GPUs in FP32 precision.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.

TensorRT Release 8.2.5

These are the TensorRT 8.2.5 Release Notes and are applicable to x86 Linux and Windows users. This release incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT release as well as the following additional changes.

These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Fixed Issues

  • There is a fast configuration for the SoftMax kernel which was not enabled previously when porting it from cuDNN. This performance regression has been fixed in this release.
  • The Scale kernel had previously incorrectly supported strides (due to concat and slice elision). This issue has been fixed with this release.

Known Issues

Functional
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
Performance
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

TensorRT Release 8.2.4

These are the TensorRT 8.2.4 Release Notes and are applicable to x86 Linux and Windows users. This release incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT release as well as the following additional changes.

These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Fixed Issues

  • UBSan issues had not been discussed in the documentation. We’ve added a new section that discusses Issues With Undefined Behavior Sanitizer in the TensorRT Developer Guide.
  • There was a functional issue when horizontal merge is followed by a concat layer with axis on non-channel dimension. The issue is fixed in this release.
  • For a network with floating-point output, when the configuration allows using INT8 in the engine, TensorRT has a heuristic for avoiding excess quantization noise in the output. Previously, the heuristic assumed that plugins were capable of floating-point output if needed, and otherwise the engine failed to build. Now, the engine will build, although without trying to avoid quantization noise from an INT8 output from a plugin. Furthermore, a plugin with an INT8 output that is connected to a network output of type INT8 now works.
  • TensorRT was incorrectly computing the size of tensors when doing memory allocations computation. This occurred in cases where dynamic shapes was triggering an integer overflow on the max opt dimension when accumulating the volumes of all the network I/O tensors.
  • TensorRT incorrectly performed horizontal fusion of batched Matmuls along the batch dimension. The issue is fixed in this release.
  • In some cases TensorRT failed to find a tactic for Pointwise. The issue is fixed in this release.
  • There was a functional issue in fused reduction kernels which would lead to an accuracy drop in WeNet transformer encoder layers. The issue is fixed in this release.
  • There were functional issues when two layer's inputs (or outputs) shared the same IQuantization/IDequantization layer. The issue is fixed in this release.
  • When fusing Convolution+Quantization or Pointwise+Quantization and the output type is constrained to INT8, you had to specify the precision for Convolution and Pointwise operations for other fusions to work correctly. If the Convolution and Pointwise precision had not been configured yet, it would have to be float because INT8 precision requires explicitly fusing with Dequantization. The issue is fixed in this release.
  • There was a known crash when building certain large GPT2-XL model variants. The issue is fixed in this release.

Known Issues

Functional
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
Performance
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

TensorRT Release 8.2.3

This is the TensorRT 8.2.3 release notes and is applicable to x86 Linux and Windows users, as well as incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Fixed Issues

  • There was a known issue where using custom allocator and allocation resizing introduced in TensorRT 8 that would trigger an assert about a p.second failure. This was caused by the application passing to TensorRT the same exact pointer from the re-allocation routine. This assertion has been fixed to be a valid use case.
  • There was an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer. This issue has been fixed in this release.
  • There was an up to 20% performance regression for INT8 QAT networks with a Padding layer located before the Q/DQ and Convolution layer. This issue has been fixed in this release.
  • TensorRT does not support explicit quantization (i.e. Q/DQ) for batched matrix-multiplication. This fix introduces support for the special case of quantized batched matrix-multiplication when matrix B is constant and can be squeezed to a 2D matrix. Specifically, in the supported configuration matrix A (the data) can have shape (BS, M, K), where BS is the batch size, and matrix B (the weights) can have shape (1, K, N). The output has shape (BS, M, N) which is computed by broadcasting the weights across the batch dimension. Quantized batched matrix-multiplication has two pairs of Q/DQ nodes that quantize the input data and the weights.
  • An incorrect fusion of two transpose operations caused an assertion to trigger while building the model. This issue has been fixed in this release.

Known Issues

Functional
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
Performance
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

TensorRT Release 8.2.2

This is the TensorRT 8.2.2 release notes and is applicable to x86 Linux and Windows users, as well as incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Fixed Issues

  • In order to install TensorRT using the pip wheel file, you had to ensure that the pip version was less than 20. An older version of pip was required to workaround an issue with the CUDA 11.4 wheel meta packages that TensorRT depends on. This issue has been fixed in this release.
  • An empty directory named deserializeTimer under the samples directory was left in the package by accident. This issue has been fixed in this release.
  • For some transformer based networks built with PyTorch Multi-head Attention API, the performance could have been up to 45% slower than similar networks built with other APIs due to different graph patterns. This issue has been fixed in this release.
  • IShuffleLayer applied to the output of IConstantLayer was incorrectly transformed when the constant did not have type kFLOAT, sometimes causing build failures. This issue has been fixed in this release.
  • ONNX models with MatMul operations that used QuantizeLinear/DequantizeLinear operations to quantize the weights, and pre-transpose the weights (i.e. do not use a separate Transpose operation) would suffer from accuracy errors due to a bug in the quantization process. This issue has been fixed in this release.

Known Issues

Functional
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM. To workaround this issue, disable CUBLAS_LT kernels with --tacticSources=-CUBLAS_LT (setTacticSources).
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
Performance
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

TensorRT Release 8.2.1

This is the TensorRT 8.2.1 release notes and is applicable to x86 Linux and Windows users, as well as incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the NVIDIA TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
  • WSL (Windows Subsystem for Linux) 2 is released as a preview feature in this TensorRT 8.2.1 GA release.

Deprecated API Lifetime

  • APIs deprecated prior to TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • DLA does not support hybrid precision for pooling layer – data type of input and output tensors should be the same as the layer precision i.e. either all INT8 or all FP16.

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.2.1:
  • BuilderFlag::kSTRICT_TYPES is deprecated. Its functionality has been split into separate controls for precision constraints, reformat-free I/O, and failure of IAlgorithmSelector::selectAlgorithms. This change enables users who need only one of the subfeatures to build engines without encumbering the optimizer with the other subfeatures. In particular, precision constraints are sometimes necessary for engine accuracy, however, reformat-free I/O risks slowing down an engine with no benefit. For more information, refer to BuilderFlags (C++, Python).
  • The LSTM plugin has been removed. In addition, the Persistent LSTM Plugin section has also been removed from the TensorRT Developer Guide.

Fixed Issues

  • Closed the performance gap between linking with TensorRT static libraries and linking with TensorRT dynamic libraries on x86_64 Linux CUDA-11.x platforms.
  • When building a DLA engine with:
    • networks with less than 4D tensors, some DLA subgraph IO tensors would lack shuffle layers. It would fail compiling the engine. This issue has been fixed in this release.
    • kSUB ElementWise operation whose input has less than 4 dimensions, a scale node was inserted. If the scale cannot run on DLA, it would fail compiling the engine. This issue has been fixed in this release.
  • There was a known issue with the engine_refit_mnist sample on Windows. The fix was to edit the engine_refit_mnist/sample.py source file and move the import model line before the sys.path.insert() line. This issue has been fixed in this release and no edit is needed.
  • There was an up to 22% performance regression compared to TensorRT 8.0 for WaveRNN networks on NVIDIA Volta and NVIDIA Turing GPUs. This issue has been fixed in this release.
  • There was an up to 21% performance regression compared to TensorRT 8.0 for BERT-like networks on NVIDIA Jetson Xavier platforms. This issue has been fixed in this release.
  • There was an up to 25% performance drop for networks using the InstanceNorm plugin. This issue has been fixed in this release.
  • There was an accuracy bug resulting in low mAP score with YOLO-like QAT networks where a QuantizeLinear operator was immediately followed by a Concat operator. This accuracy issue has been fixed in this release.
  • There was a bug in TensorRT 8.2.0 EA where if a shape tensor is used by two different nodes, it can sometimes lead to a functional or an accuracy issue. This issue has been fixed in this release.
  • There was a known 2% accuracy regression with NasNet Mobile network with NVIDIA Turing GPUs. This issue has been fixed in this release. (not applicable for Jetson platforms)
  • There could have been build failures with IEinsumLayer when an input subscript label corresponds to a static dimension in one tensor but dynamic dimension in another. This issue has been fixed in this release.
  • There was an up to 8% performance regression compared to TensorRT 8.0.3 for Cortana ASR on NVIDIA Ampere Architecture GPUs with CUDA graphs and a single stream of execution. This issue has been fixed in this release.
  • Boolean input/output tensors were only supported when using explicit batch dimensions. This has been fixed in this release.
  • There was a possibility of an CUBLAS_STATUS_EXECUTION_FAILED error when running with cuBLAS/cuBLASLt libraries from CUDA 11.4 update 1 and CUDA 11.4 update 2 on Linux-based platforms. This happened only for the use cases where cuBLAS is loaded and unloaded multiple times. The workaround was to add the following environment variable before launching your application:
    LD_PRELOAD=libcublasLt.so:libcublasLt.so your_application

    This issue has been fixed in this release.

  • There was a known ~6% - ~29% performance regression on Google® BERT compared to version 8.0.1.6 on NVIDIA A100 GPUs. This issue has been fixed in this release.
  • There was an up to 6% performance regression compared to TensorRT 8.0.3 for Deep Recommender on Tesla V100, NVIDIA Quadro® GV100, and NVIDIA TITAN V. This issue has been fixed in this release.

Announcements

  • The sample sample_reformat_free_io has been renamed to sample_io_formats, and revised to remove the deprecated flag BuilderFlag::kSTRICT_TYPES. Reformat-free I/O is still available with BuilderFlag::kDIRECT_IO, but generally should be avoided since it can result in a slower than necessary engine, and can cause a build to fail if the target platform lacks the kernels to enable building an engine with reformat-free I/O.
  • The NVIDIA TensorRT Release Notes PDF will no longer be available in the product package after this release. The release notes will still remain available online here.

Known Issues

Functional
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM. To workaround this issue, disable CUBLAS_LT kernels with --tacticSources=-CUBLAS_LT (setTacticSources).
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • For some transformer based networks built with PyTorch Multi-head Attention API, the performance may be up to 45% slower than similar networks built with other APIs due to different graph patterns.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
  • When installing PyCUDA, NumPy must be installed first and as a separate step:
    python3 -m pip install numpy 
    python3 -m pip install pycuda
    For more information, refer to the NVIDIA TensorRT Installation Guide.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • If an IPluginV2 layer produces kINT8 outputs that are output tensors of an INetworkDefinition that have floating-point type, an explicit cast is required to convert the network outputs back to a floating point format. For example:
    // out_tensor is of type nvinfer1::DataType::kINT8
    auto cast_input = network->addIdentity(*out_tensor);
    cast_input->setOutputType(0, nvinfer1::DataType::kFLOAT);
    new_out_tensor = cast_input->getOutput(0);
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
  • ONNX models with MatMul operations that use QuantizeLinear/DequantizeLinear operations to quantize the weights, and pre-transpose the weights (i.e. do not use a separate Transpose operation) will suffer from accuracy errors due to a bug in the quantization process.
Performance
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

TensorRT Release 8.2.0 Early Access (EA)

This is the TensorRT 8.2.0 Early Access (EA) release notes and is applicable to x86 Linux and Windows users, as well as ARM Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.

Breaking API Changes

  • Between TensorRT 8.0 EA and TensorRT 8.0 GA the function prototype for getLogger() has been moved from NvInferRuntimeCommon.h to NvInferRuntime.h. You may need to update your application source code if you’re using getLogger() and were previously only including NvInferRuntimeCommon.h. Since the logger is no longer global, calling the method on IRuntime, IRefitter, or IBuilder is recommended instead.

Compatibility

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.2.0 EA:
  • Removed sampleMLP.
  • The enums of ProfilingVerbosity have been updated to show their functionality more explicitly:
    • ProfilingVerbosity::kDEFAULT has been deprecated in favor of ProfilingVerbosity::kLAYER_NAMES_ONLY.
    • ProfilingVerbosity::kVERBOSE has been deprecated in favor of ProfilingVerbosity::kDETAILED.
  • Several flags of trtexec have been deprecated:
    • --explicitBatch flag has been deprecated and has no effect. When the input model is in UFF or in Caffe prototxt format, the implicit batch dimension mode is used automatically; when the input model is in ONNX format, the explicit batch mode is used automatically.
    • --explicitPrecision flag has been deprecated and has no effect. When the input ONNX model contains Quantization/Dequantization nodes, TensorRT automatically uses explicit precision mode.
    • --nvtxMode=[verbose|default|none] has been deprecated in favor of --profilingVerbosity=[detailed|layer_names_only|none] to show its functionality more explicitly.
  • Relocated the content from the Best Practices For TensorRT Performance document to the TensorRT Developer Guide. Removed redundancy between the two documents and updated the reference information. Refer to Performance Best Practices in the TensorRT Developer Guide for more information.
  • IPaddingLayer is deprecated in TensorRT 8.2 and will be removed in TensorRT 10.0. Use ISliceLayer to pad the tensor, which supports new non-constant, reflects padding mode and clamp, and supports padding output with dynamic shape.

Fixed Issues

  • Closed the performance gap between linking with TensorRT static libraries and linking with TensorRT dynamic libraries.
  • In the previous release, the TensorRT ARM SBSA cross packages in the CUDA network repository could not be installed because cuDNN ARM SBSA cross packages were not available, which is a dependency of the TensorRT cross packages. The cuDNN ARM SBSA cross packages have been made available, which resolves this dependency issue.
  • There was an up to 6% performance regression compared to TensorRT 7.2.3 for WaveRNN in FP16 on Volta and Turing platforms. This issue has been fixed in this release.
  • There was a known accuracy issue of GoogLeNet variants with NVIDIA Ampere GPUs where TF32 mode was enabled by default on Windows. This issue has been fixed in this release. (not applicable for Jetson platforms)
  • There was an up to 10% performance regression when TensorRT was used with cuDNN 8.1 or 8.2. When cuDNN 8.0 was used, the performance was restored. This issue has been fixed in this release. (not applicable for Jetson platforms)
  • The new Python sample efficientdet was only available in the OSS release. The sample has been added to the core package in this release.
  • The performance of IReduceLayer has been improved significantly when the output size of the IReduceLayer is small.
  • There was an up to 15% performance regression compared to TensorRT 7.2.3 for path perception network (Pathnet) in FP32. This issue has been fixed in this release.

Announcements

  • Python support for Windows included in the zip package is considered a preview release and not ready for production use.

Known Issues

Functional
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • For some transformer based networks built with PyTorch MultiheadAttention API, the performance may be up to 45% slower than similar networks built with other APIs due to different graph patterns.
  • When building a DLA engine with:
    • networks with less than 4D tensors, some DLA subgraph IO tensors may lack shuffle layers. It will fail compiling the engine.
    • kSUB ElementWise operation whose input has less than 4 dimensions, a scale node is inserted. If the scale cannot run on DLA, it will fail compiling the engine.
  • There is a known 2% accuracy regression with NasNet Mobile network with NVIDIA Turing GPUs. (not applicable for Jetson platforms)
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Boolean input/output tensors are supported only when using explicit batch dimensions.
  • There is a possibility of an CUBLAS_STATUS_EXECUTION_FAILED error when running with cuBLAS/cuBLASLt libraries from CUDA 11.4 update 1 and CUDA 11.4 update 2 on Linux-based platforms. This happens only for the use cases where cuBLAS is loaded and unloaded multiple times. The workaround is to add the following environment variable before launching your application:
    LD_PRELOAD=libcublasLt.so:libcublasLt.so your_application

    This issue will be resolved in a future CUDA 11.4 update.

  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • There may be build failures with IEinsumLayer when an input subscript label corresponds to a static dimension in one tensor but dynamic dimension in another.
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing ConstantLayer and ShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • There is a known issue with the engine_refit_mnist sample on Windows. To fix the issue, edit the engine_refit_mnist/sample.py source file and move the import model line before the sys.path.insert() line.
  • An empty directory named deserializeTimer under the samples directory was left in the package by accident. This empty directory can be ignored and does not indicate that the package is corrupt or that files are missing. This issue will be corrected in the next release.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • In order to install TensorRT using the pip wheel file, ensure that the pip version is less than 20. An older version of pip is required to workaround an issue with the CUDA 11.4 wheel meta packages that TensorRT depends on. This issue is being worked on and should be resolved in a future CUDA release.
Performance
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on Maxwell and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on Nano.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is a known ~6% - ~29% performance regression on Google BERT compared to version 8.0.1.6 on NVIDIA A100 GPUs.
  • There is an up to 6% performance regression compared to TensorRT 8.0.3 for Deep Recommender on Tesla V100, Quadro GV100, and Titan V.
  • There is an up to 22% performance regression compared to TensorRT 8.0 for WaveRNN networks on Volta and Turing GPUs. This issue is being investigated.
  • There is up to 8% performance regression compared to TensorRT 8.0.3 for Cortana ASR on NVIDIA Ampere GPUs with CUDA graphs and a single stream of execution.
  • There is an up to 21% performance regression compared to TensorRT 8.0 for BERT-like networks on Xavier platforms. This issue is being investigated.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on Volta GPUs.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.

TensorRT Release 8.0.3

This is the TensorRT 8.0.3 release notes. This is a bug fix release supporting Linux x86 and Windows users.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Fixed Issues

  • Fixed an invalid fusion assertion problem in the fusion optimization pass.
  • Fixed other miscellaneous issues seen in proprietary networks.
  • Fixed a CUDA 11.4 NVRTC issue during kernel generation on Windows.

TensorRT Release 8.0.2

This is the TensorRT 8.0.2 release notes. This is the initial release supporting A100 for ARM server users. Only a subset of networks have been validated on ARM with A100. This is a network repository release only.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

TensorRT Release 8.0.1

This is the TensorRT 8.0.1 release notes and is applicable to x86 Linux and Windows users, as well as PowerPC and ARM Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.

Breaking API Changes

  • Support for Python 2 has been dropped. This means that TensorRT will no longer include wheels for Python 2, and Python samples will not work with Python 2.
  • All API's have been marked as noexcept where appropriate. The IErrorRecorder interface has been fully integrated into the API for error reporting. The Logger is only used as a fallback when the ErrorRecorder is not provided by the user.
  • Callback changes are now marked noexcept, therefore, implementations must also be marked noexcept. TensorRT has never catered to exceptions thrown by callbacks, but this is now captured in the API.
  • Methods that take parameters of type void** where the array of pointers is unmodifiable are now changed to take type void*const*.
  • Dims is now a type alias for class Dims32. Code that forward-declares Dims should forward-declare class Dims32; using Dims = Dims32;.
  • Between TensorRT 8.0 EA and TensorRT 8.0 GA the function prototype for getLogger() has been moved from NvInferRuntimeCommon.h to NvInferRuntime.h. You may need to update your application source code if you’re using getLogger() and were previously only including NvInferRuntimeCommon.h.

Compatibility

Limitations

  • For QAT networks, TensorRT 8.0 supports per-tensor and per-axis quantization scales for weights. For activations, only per-tensor quantization is supported. Only symmetric quantization is supported and zero-point weights may be omitted or, if zero-points are provided, all coefficients must have a value of zero.
  • Loops and DataType::kBOOL are not supported when the static TensorRT library is used. Performance improvements for transformer based architectures such as BERT will also not be available when using the static TensorRT library.
  • When using reformat-free I/O, the extent of a tensor in a vectorized dimension might not be a multiple of the vector length. Elements in a partially occupied vector that are not within the tensor are referred to here as vector-padding. For example:
    • On GPU
      • for input tensors, the application shall set vector-padding elements to zero.
      • for output tensors, the value of vector-padding elements is undefined. In a future release, TensorRT will support setting them to zero.
    • On DLA
      • for input tensors, vector-padding elements are ignored.
      • for output tensors, vector-padding elements are unmodified.
  • When running INT8 networks on DLA using TensorRT, operations must be added to the same subgraph to reduce quantization errors across the subgraph of the network that runs on the DLA by allowing them to fuse and retain higher precision for intermediate results. Breaking apart the subgraph in order to inspect intermediate results by setting the tensors as network output tensors, can result in different levels of quantization errors due to these optimizations being disabled.
  • If both kSPARSE_WEIGHTS and kREFIT flags are set in IBuilderConfig, the convolution layers having structured sparse kernel weights cannot be refitted with new kernel weights which do not have structured sparsity. The IRefitter::setWeights() will print an error and return false in that case.
  • Samples which require TensorFlow in order to run, which typically also use UFF models, are not supported on ARM SBSA releases of TensorRT 8.0. There is no good source for TensorFlow 1.15.x for AArch64 that also supports Python 3.8 which can be used to run these samples.
  • Using CUDA graph capture on TensorRT execution contexts with CUDA 10.2 on NVIDIA K80 GPUs may lead to graph capture failures. Upgrading to CUDA 11.0 or above will solve the issue. (not applicable for Jetson platforms)
  • On RHEL and CentOS, the TensorRT RPM packages for CUDA 11.3 cannot be installed alongside any CUDA 11.x Toolkit packages, like the Debian packages, due to RPM packaging limitations. The TensorRT runtime library packages can only be installed alongside CUDA 11.2 and CUDA 11.3 Toolkit packages and the TensorRT development packages can only be installed alongside CUDA 11.3 Toolkit packages. When using the TAR package, the TensorRT CUDA 11.3 build can be used with any CUDA 11.x Toolkit.

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.0.1:
  • Deprecation is used to inform developers that some APIs and tools are no longer recommended for use. TensorRT has the following deprecation policy:
    • This policy comes into effect beginning with TensorRT 8.0.
    • Deprecation notices are communicated in the release notes. Deprecated API elements are marked with the TRT_DEPRECATED macro where possible.
    • TensorRT provides a 12-month migration period after the deprecation. For any APIs and tools deprecated in TensorRT 7.x, the 12-month migration period starts from the TensorRT 8.0 GA release date.
    • APIs and tools will continue to work during the migration period.
    • After the migration period ends, we reserve the right to remove the APIs and tools in a future release.
  • IRNNLayer was deprecated in TensorRT 4.0 and has been removed in TensorRT 8.0. IRNNv2Layer was deprecated in TensorRT 7.2.1. IRNNv2Layer has been deprecated in favor of the loop API, however, it is still available for backwards compatibility. For more information about the loop API, refer to the sampleCharRNN sample with the --Iloop option as well as the Working With Loops chapter.
  • IPlugin and IPluginFactory interfaces were deprecated in TensorRT 6.0 and have been removed in TensorRT 8.0. We recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt and IPluginV2IOExt interfaces. For more information, refer to the Migrating Plugins From TensorRT 6.x Or 7.x To TensorRT 8.x.x section.
  • We removed samplePlugin since it was meant to demonstrate the IPluginExt interface, which is no longer supported in TensorRT 8.0.
  • We have deprecated the Caffe Parser and UFF Parser in TensorRT 7.0. They are still tested and functional in TensorRT 8.0, however, we plan to remove the support in the future. Ensure you migrate your workflow to use tf2onnx, keras2onnx or TensorFlow-TensorRT (TF-TRT) for deployment.

    If using UFF, ensure you migrate to the ONNX workflow through the enablement of a plugin. ONNX workflow is not dependent on plugin enablement. For plugin enablement of a plugin on ONNX, refer to Estimating Depth with ONNX Models and Custom Layers Using NVIDIA TensorRT.

    Caffe and UFF-specific topics in the Developer Guide have been moved to the Appendix section until removal in the subsequent major release.

  • Interface functions that provided a destroy function are deprecated in TensorRT 8.0. The destructors will be exposed publicly in order for the delete operator to work as expected on these classes.
  • nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION is deprecated. Networks that have QuantizeLayer and DequantizeLayer layers will be automatically processed using Q/DQ-processing, which includes explicit-precision semantics. Explicit precision is a network-optimizer constraint that prevents the optimizer from performing precision-conversions that are not dictated by the semantics of the network. For more information, refer to the Working With QAT Networks section in the TensorRT Developer Guide.
  • nvinfer1::IResizeLayer::setAlignCorners and nvinfer1::IResizeLayer::getAlignCorners are deprecated. Use nvinfer1::IResizeLayer::setCoordinateTransformation, nvinfer1::IResizeLayer::setSelectorForSinglePixel and nvinfer1::IResizeLayer::setNearestRounding instead.
  • Destructors for classes with destroy() methods were previously protected. They are now public, enabling use of smart pointers for these classes. The destroy() methods are deprecated.
  • The CgPersistentLSTMPlugin_TRT plugin is deprecated.
  • sampleMovieLens and sampleMovieLensMPS have been removed from the TensorRT package.
  • The following C++ API functions, types, and a field, which were previously deprecated, were removed:
    Core Library:
    • DimensionType
    • Dims::Type
    • class DimsCHW
    • class DimsNCHW
    • class IOutputDimensionFormula
    • class IPlugin
    • class IPluginFactory
    • class IPluginLayer
    • class IRNNLayer
    • IBuilder::getEngineCapability()
    • IBuilder::allowGPUFallback()
    • IBuilder::buildCudaEngine()
    • IBuilder::canRunOnDLA()
    • IBuilder::createNetwork()
    • IBuilder::getAverageFindIterations()
    • IBuilder::getDebugSync()
    • IBuilder::getDefaultDeviceType()
    • IBuilder::getDeviceType()
    • IBuilder::getDLACore()
    • IBuilder::getFp16Mode()
    • IBuilder::getHalf2Mode()
    • IBuilder::getInt8Mode()
    • IBuilder::getMaxWorkspaceSize()
    • IBuilder::getMinFindIterations()
    • IBuilder::getRefittable()
    • IBuilder::getStrictTypeConstraints()
    • IBuilder::isDeviceTypeSet()
    • IBuilder::reset()
    • IBuilder::resetDeviceType()
    • IBuilder::setAverageFindIterations()
    • IBuilder::setDebugSync()
    • IBuilder::setDefaultDeviceType()
    • IBuilder::setDeviceType()
    • IBuilder::setDLACore()
    • IBuilder::setEngineCapability()
    • IBuilder::setFp16Mode()
    • IBuilder::setHalf2Mode()
    • IBuilder::setInt8Calibrator()
    • IBuilder::setInt8Mode()
    • IBuilder::setMaxWorkspaceSize()
    • IBuilder::setMinFindIterations()
    • IBuilder::setRefittable()
    • IBuilder::setStrictTypeConstraints()
    • ICudaEngine::getWorkspaceSize()
    • IMatrixMultiplyLayer::getTranspose()
    • IMatrixMultiplyLayer::setTranspose()
    • INetworkDefinition::addMatrixMultiply()
    • INetworkDefinition::addPlugin()
    • INetworkDefinition::addPluginExt()
    • INetworkDefinition::addRNN()
    • INetworkDefinition::getConvolutionOutputDimensionsFormula()
    • INetworkDefinition::getDeconvolutionOutputDimensionsFormula()
    • INetworkDefinition::getPoolingOutputDimensionsFormula()
    • INetworkDefinition::setConvolutionOutputDimensionsFormula()
    • INetworkDefinition::setDeconvolutionOutputDimensionsFormula()
    • INetworkDefinition::setPoolingOutputDimensionsFormula()
    • ITensor::getDynamicRange()
    • TensorFormat::kNHWC8
    • TensorFormat::NCHW
    • TensorFormat::kNC2HW2
    Plugins: The following plugin classes were removed:
    • class INvPlugin
    • createLReLUPlugin()
    • createClipPlugin()
    • PluginType
    • struct SoftmaxTree

    Plugin interface methods: For plugins based on IPluginV2DynamicExt and IPluginV2IOExt, certain methods with legacy function signatures (derived from IPluginV2 and IPluginV2Ext base classes) which were deprecated and marked for removal in TensorRT 8.0 will no longer be available. Plugins using these interface methods must stop using them or implement the versions with updated signatures, as applicable.

    Unsupported plugin methods removed in TensorRT 8.0:
    • IPluginV2DynamicExt::canBroadcastInputAcrossBatch()
    • IPluginV2DynamicExt::isOutputBroadcastAcrossBatch()
    • IPluginV2DynamicExt::getTensorRTVersion()
    • IPluginV2IOExt::configureWithFormat()
    • IPluginV2IOExt::getTensorRTVersion()
    Use updated versions for supported plugin methods:
    • IPluginV2DynamicExt::configurePlugin()
    • IPluginV2DynamicExt::enqueue()
    • IPluginV2DynamicExt::getOutputDimensions()
    • IPluginV2DynamicExt::getWorkspaceSize()
    • IPluginV2IOExt::configurePlugin()
    Use newer methods for the following:
    • IPluginV2DynamicExt::supportsFormat() has been removed,use IPluginV2DynamicExt::supportsFormatCombination() instead.
    • IPluginV2IOExt::supportsFormat() has been removed,use IPluginV2IOExt::supportsFormatCombination() instead.
    Caffe Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
    UFF Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
  • The following Python API functions, which were previously deprecated, were removed:
    Core library:
    • class DimsCHW
    • class DimsNCHW
    • class IPlugin
    • class IPluginFactory
    • class IPluginLayer
    • class IRNNLayer
    • Builder.build_cuda_engine()
    • Builder.average_find_iterations
    • Builder.debug_sync
    • Builder.fp16_mode
    • IBuilder.int8_mode
    • Builder.max_workspace_size
    • Builder.min_find_iterations
    • Builder.refittable
    • Builder.strict_type_constraints
    • ICudaEngine.max_workspace_size
    • IMatrixMultiplyLayer.transpose0
    • IMatrixMultiplyLayer.transpose0
    • INetworkDefinition.add_matrix_multiply_deprecated()
    • INetworkDefinition.add_plugin()
    • INetworkDefinition.add_plugin_ext()
    • INetworkDefinition.add_rnn()
    • INetworkDefinition.convolution_output_dimensions_formula
    • INetworkDefinition.deconvolution_output_dimensions_formula
    • INetworkDefinition.pooling_output_dimensions_formula
    • ITensor.get_dynamic_range()
    • Dims.get_type()
    • TensorFormat.HWC8
    • TensorFormat.NCHW
    • TensorFormat.NCHW2
    Caffe Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
    UFF Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
    Plugins:
    • class INvPlugin
    • createLReLUPlugin()
    • createClipPlugin()
    • PluginType
    • struct SoftmaxTree
  • The following Python API functions were removed:
    Core library:
    • class DimsCHW
    • class DimsNCHW
    • class IPlugin
    • class IPluginFactory
    • class IPluginLayer
    • class IRNNLayer
    • Builder.build_cuda_engine()
    • Builder.average_find_iterations
    • Builder.debug_sync
    • Builder.fp16_mode
    • IBuilder.int8_mode
    • Builder.max_workspace_size
    • Builder.min_find_iterations
    • Builder.refittable
    • Builder.strict_type_constraints
    • ICudaEngine.max_workspace_size
    • IMatrixMultiplyLayer.transpose0
    • IMatrixMultiplyLayer.transpose0
    • INetworkDefinition.add_matrix_multiply_deprecated()
    • INetworkDefinition.add_plugin()
    • INetworkDefinition.add_plugin_ext()
    • INetworkDefinition.add_rnn()
    • INetworkDefinition.convolution_output_dimensions_formula
    • INetworkDefinition.deconvolution_output_dimensions_formula
    • INetworkDefinition.pooling_output_dimensions_formula
    • ITensor.get_dynamic_range()
    • Dims.get_type()
    • TensorFormat.HWC8
    • TensorFormat.NCHW
    • TensorFormat.NCHW2
    Caffe Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • CaffeParser.plugin_factory
    • CaffeParser.plugin_factory_ext
    UFF Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • UffParser.plugin_factory
    • UffParser.plugin_factory_ext

Fixed Issues

  • The diagram in IRNNv2Layer was incorrect. The diagram has been updated and fixed.
  • Improved build times for convolution layers with dynamic shapes and large range of leading dimensions.
  • TensorRT 8.0 no longer requires libcublas.so.* to be present on your system when running an application which was linked with the TensorRT static library. The TensorRT static library now requires cuBLAS and other dependencies to be linked at link time and will no longer open these libraries using dlopen().
  • TensorRT 8.0 no longer requires an extra Identity layer between the ElementWise and the Constant whose rank is > 4. For TensorRT 7.x versions, cases like Convolution and FullyConnected with bias where ONNX decomposes the bias to ElementWise, there was a fusion which didn’t support per element scale. We previously inserted an Identity to workaround this.
  • There was a known performance regression compared to TensorRT 7.1 for Convolution layers with kernel size greater than 5x5. For example, it could lead up to 35% performance regression of the VGG16 UFF model compared to TensorRT 7.1. This issue has been fixed in this release.
  • When running networks such as Cortana, LSTM Peephole, MLP, and Faster RCNN, there was a 5% to 16% performance regression on GA102 devices and a 7% to 36% performance regression on GA104 devices. This issue has been fixed in this release. (not applicable for Jetson platforms)
  • Some RNN networks such as Cortana with FP32 precision and batch size of 8 or higher had a 20% performance loss with CUDA 11.0 or higher compared to CUDA 10.2. This issue has been fixed in this release.
  • There was an issue when compiling the TensorRT samples with a GCC version less than 5.x and using the static libraries which resulted in the error message munmap_chunk(): invalid pointer. RHEL/CentOS 7.x users were most likely to have observed this issue. This issue has been fixed in this release.
  • cuTENSOR, used by TensorRT 8.0 EA, was known to have significant performance regressions with the CUDA 11.3 compiler. This regression has been fixed by the CUDA 11.3 update 1 compiler.
  • The installation of PyTorch Quantization Toolkit requires Python version >=3.7, GCC version >=5.4. The specific version of Python may be missing from some operating systems and will need to be separately installed. Refer to the README instructions for the workaround.
  • On platforms with Python >= 3.8, TensorFlow 1.x must be installed from the NVIDIA Python package index. For example:
    pip install --extra-index-url https://pypi.ngc.nvidia.com nvidia-tensorflow;
            python_version==3.8
  • There is an up to 15% performance regression compared to TensorRT 7.2.3 for QuartzNet variants on Volta GPUs.
  • MNIST images used by the samples previously had to be downloaded manually. These images are now shipped with the samples.
  • You may observe relocation issues during linking if the resulting binary exceeds 2 GB. This can occur if you are linking TensorRT and all of its dependencies into your application statically. A workaround for this linking issue has been documented in the TensorRT Sample Support Guide under Limitations.
  • IProfiler would not correctly call user-implemented methods when used from the Python API. This issue has been fixed in this release.
  • TensorRT memory usage has improved and can be better managed via IGpuAllocator::reallocate when more memory is required.
  • TensorRT refitting performance has been improved, especially for large weights and when multiple weights are refitted at the same time. Refitting performance will continue to be optimized in later releases.
  • The interfaces that took an argument of type void** (for example, enqueueV2) now declare it as void*const*.
  • There was an up to 24% performance regression in TensorRT 8.0.0 compared to TensorRT 7.2.3 for networks containing Slice layers on Turing GPUs. This issue has been fixed.
  • There was an up to 8% performance regression in TensorRT 8.0.0 compared to TensorRT 7.2.3 for DenseNet variants on Volta GPUs. This issue has been fixed in this release.
  • If input tensors with dynamic shapes were found to be inconsistent with the selected optimization profile during engine building or during inference, an error message is issued with graceful program exit instead of assertion failure and abnormal exit.
  • When running TensorRT 8.0.0 with cuDNN 8.2.0, there is a known performance regression for the deconvolution layer compared to running with previous cuDNN releases. For example, some deconvolution layers can have up to 7x performance regression on Turing GPUs compared to running with cuDNN 8.0.4. This has been fixed in the latest cuDNN 8.2.1 release.

Announcements

  • TensorRT 8.0 will be the last TensorRT release that will provide support for Ubuntu 16.04. This also means TensorRT 8.0 will be the last TensorRT release that will support Python 3.5.
  • Python samples use a unified data downloading workflow. Each sample has a YAML (download.yml) describing the data files that are required to download via a link before running the sample, if any. The download tool parses the YAML and downloads the data files. All other sample code assumes that the data has been downloaded before the code is invoked. An error will be raised if the data is not correctly downloaded. Refer to the Python sample documentation for more information.

Known Issues

  • The TensorRT ARM SBSA cross packages in the CUDA network repository cannot be installed because cuDNN ARM SBSA cross packages are not available, which is a dependency of the TensorRT cross packages. The TensorRT ARM SBSA cross packages may be removed in the near future. You should use the native TensorRT ARM SBSA packages instead.
  • There is a known issue that graph capture may fail in some cases for IExecutionContext::enqueue() and IExecutionContext::enqueueV2(). For more information, refer to the documentation for IExecutionContext::enqueueV2(), including how to work around this issue.
  • On PowerPC, some RNN networks have up to a 15% performance regression compared to TensorRT 7.0. (not applicable for Jetson platforms)
  • Some fusions are not enabled when the TensorRT static library is used. This means there is a performance loss of around 10% for networks like BERT and YOLO3 when linking with the static library compared to the dynamic library. The performance loss depends on precision used and batch size and it can be up to 60% in some cases.
  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit those weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().
  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.
  • There is a known performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on Maxwell and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on Nano.
  • There is a known issue that TensorRT selects kLINEAR format when the user uses reformat-free I/O with vectorized formats and with input/output tensors which have only 3 dimensions. The workaround is to add an additional dimension to the tensors with size 1 to make them 4 dimensional tensors.
  • As DLA Deconvolution layers with square kernels and strides between 23 and 32 significantly slow down compilation time, they are disabled by TensorRT to run on DLA.
  • There are some known false alarms reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the false alarms is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    
    {
     
      Tegra ioctl false alarm
      Memcheck:Param
      ioctl(TCGETA)
      fun:ioctl
      ...
      obj:*libnvrm_gpu.so*
      ...
      obj:*libcuda.so*
    }
    

    The suppression file can resolve the false alarms about definite loss related to dlopen() and ioctl() definite loss on the Tegra platform. The other false alarm which can not be added to the suppression file is a sole malloc() call without any call stack.

  • There is an up to 150% performance regression compared to TensorRT 7.2.3 for 3D U-Net variants on NVIDIA Ampere GPUs, if the optimal algorithm choice is constrained by the available workspace. To work around this issue, enlarge the workspace size. (not applicable for Jetson platforms)
  • PluginFieldCollection in the Python API may prematurely deallocate PluginFields. To work around this, assign the list of plugin fields to a named variable:
    plugin_fields = [trt.PluginField(...), ...]
    plugin_field_collection = trt.PluginFieldCollection(plugin_fields)
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • On Jetson devices, the power consumption may increase for the sake of performance improvement when compared against TensorRT 7.1. No significant drop in the performance per watt has been observed.
  • There is an up to 15% performance regression compared to TensorRT 7.2.3 for path perception network (Pathnet) in FP32.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time could drop a lot for this layer on Xavier. It could also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • For some networks with large amounts of weights and activation data, DLA may fail compiling the subgraph, and that subgraph will fallback to GPU.
  • There is an up to 10% performance regression when TensorRT is used with cuDNN 8.1 or 8.2. When cuDNN 8.0 is used, the performance is recovered. (not applicable for Jetson platforms)
  • There is an up to 6% performance regression compared to TensorRT 7.2.3 for WaveRNN in FP16 on Volta and Turing platforms.
  • On embedded devices, TensorRT attempts to avoid testing kernel candidates whose memory requirements would trigger the Out of Memory (OOM) killer. If it does trigger, consider reducing the memory requirement for the model by reducing index dimensions, or maximize the available memory by closing other applications.
  • There is a known accuracy issue of GoogLeNet variants with NVIDIA Ampere GPUs where TF32 mode is enabled by default on windows. (not applicable for Jetson platforms)
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. When CUDA 11.0 is used, the regression is recovered. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • There is a known 4% accuracy regression with Faster R-CNN NasNet network with NVIDIA Ampere and Turing GPUs. (not applicable for Jetson platforms)
  • Under some conditions, RNNv2Layer can require a larger workspace size than previous versions of TensorRT in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • Engine build times for TensorRT 8.0 may be slower than TensorRT 7.2 due to the engine optimizer being more aggressive.
  • There is an up to 30% performance regression with QAT (quantization-aware-training) EfficientNet networks on V100 compared to TensorRT 7.2. (not applicable for Jetson platforms)
  • The new Python sample efficientdet is only available in the OSS release and will be added in the core package in the next release.

TensorRT Release 8.0.0 Early Access (EA)

This is the TensorRT 8.0.0 Early Access (EA) release notes and is applicable to Linux x86 users.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 7.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added support for RedHat/CentOS 8.3, Ubuntu 20.04, and SUSE Linux Enterprise Server 15 Linux distributions. Only a tar file installation is supported on SLES 15 at this time. For more information, refer to the TensorRT Installation Guide.
  • Added Python 3.9 support. Use a tar file installation to obtain the new Python wheel files. For more information, refer to the TensorRT Installation Guide.
  • Added ResizeCoordinateTransformation, ResizeSelector, and ResizeRoundMode; three new enumerations to IResizeLayer, and enhanced IResizeLayer to support more resize modes from TensorFlow, PyTorch, and ONNX. For more information, refer to the IResizeLayer section in the TensorRT Developer Guide.
  • Builder timing cache can be serialized and reused across builder instances. For more information, refer to the Builder Layer Timing Cache and trtexec sections in the TensorRT Developer Guide.
  • Added convolution and fully-connected tactics which support and make use of structured sparsity in kernel weights. This feature can be enabled by setting the kSPARSE_WEIGHTS flag in IBuilderConfig. This feature is only available on NVIDIA Ampere GPUs. For more information, refer to the Structured Sparsity section in the Best Practices For TensorRT Performance guide. (not applicable for Jetson platforms)
  • Added two new layers to the API: IQuantizeLayer and IDequantizeLayer which can be used to explicitly specify the precision of operations and data buffers. ONNX’s QuantizeLinear and DequantizeLinear operators are mapped to these new layers which enables the support for networks trained using Quantization-Aware Training (QAT) methodology. For more information, refer to the Explicit-Quantization, IQuantizeLayer, and IDequantizeLayer sections in the TensorRT Developer Guide and Q/DQ Fusion in the Best Practices For TensorRT Performance guide.
  • Achieved QuartzNet optimization with support of 1D fused depthwise + pointwise convolution kernel to achieve up to 1.8x end-to-end performance improvement on A100. (not applicable for Jetson platforms)
  • Added support for the following ONNX operators: Celu, CumSum, EyeLike, GatherElements, GlobalLpPool, GreaterOrEqual, LessOrEqual, LpNormalization, LpPool, ReverseSequence, and SoftmaxCrossEntropyLoss. For more information, refer to the Supported Ops section in the TensorRT Support Matrix.
  • Added Sigmoid/Tanh INT8 support for DLA. It allows DLA sub-graph with Sigmoid/Tanh to compile with INT8 by auto-upgrade to FP16 internally. For more information, refer to the DLA Supported Layers section in the TensorRT Developer Guide.
  • Added DLA native planar format and DLA native gray-scale format support.
  • Allow to generate reformat-free engine with DLA when EngineCapability is EngineCapability::kDEFAULT.
  • TensorRT now declares API’s with the noexcept keyword to clarify that exceptions must not cross the library boundary. All TensorRT classes that an application inherits from (such as IGpuAllocator, IPluginV2, etc…) must guarantee that methods called by TensorRT do not throw uncaught exceptions, or the behavior is undefined.
  • TensorRT reports errors, along with an associated ErrorCode, via the ErrorRecorder API for all errors. The ErrorRecorder will fallback to the legacy logger reporting, with Severity::kERROR or Severity::kINTERNAL_ERROR, if no error recorder is registered. The ErrorCodes allow recovery in cases where TensorRT previously reported non-recoverable situations.
  • Improved performance of the GlobalAveragePooling operation, which is used in some CNNs like EfficientNet. For transformer based networks with INT8 precision, it’s recommended to use a network which is trained using Quantization Aware Training (QAT) and has IQuantizeLayer and IDequantizeLayer layers in the network definition.
  • TensorRT now supports refit weights via names. For more information, refer to Refitting An Engine in the TensorRT Developer Guide.
  • Refitting performance has been improved. The performance boost can be evident when the weights are large or a large number of weights or layers are updated at the same time.
  • Added a new sample.This sample, engine_refit_onnx_bidaf, builds an engine from the ONNX BiDAF model, and refits the TensorRT engine with weights from the model. The new refit APIs allow users to locate the weights via names from ONNX models instead of layer names and weights roles. For more information, refer to the Refitting An Engine Built From An ONNX Model In Python in the TensorRT Sample Support Guide.
  • Improved performance for the transformer based networks such as BERT and other networks that use Multi-Head Self-Attention.
  • Added cuDNN to the IBuilderConfig::setTacticSources enum. Use of cuDNN as a source of operator implementations can be enabled or disabled using the IBuilderConfig::setTacticSources API function.
  • The following C++ API functions were added:
    • class IDequanzizeLayer
    • class IQuantizeLayer
    • class ITimingCache
    • IBuilder::buildSerializedNetwork()
    • IBuilderConfig::getTimingCache()
    • IBuilderConfig::setTimingCache()
    • IGpuAllocator::reallocate()
    • INetworkDefinition::addDequantize()
    • INetworkDefinition::addQuantize()
    • INetworkDefinition::setWeightsName()
    • IPluginRegistry::deregisterCreator()
    • IRefitter::getMissingWeights()
    • IRefitter::getAllWeights()
    • IRefitter::setNamedWeights()
    • IResizeLayer::getCoordinateTransformation()
    • IResizeLayer::getNearestRounding()
    • IResizeLayer::getSelectorForSinglePixel()
    • IResizeLayer::setCoordinateTransformation()
    • IResizeLayer::setNearestRounding()
    • IResizeLayer::setSelectorForSinglePixel()
    • IScaleLayer::setChannelAxis()
    • enum ResizeCoordinateTransformation
    • enum ResizeMode
    • BuilderFlag::kSPARSE_WEIGHTS
    • TacticSource::kCUDNN
    • TensorFormat::kDLA_HWC4
    • TensorFormat::kDLA_LINEAR
    • TensorFormat::kHWC16
  • The following Python API functions were added:
    • class IDequanzizeLayer
    • class IQuantizeLayer
    • class ITimingCache
    • Builder.build_serialized_network()
    • IBuilderConfig.get_timing_cache()
    • IBuilderConfig.set_timing_cache()
    • IGpuAllocator.reallocate()
    • INetworkDefinition.add_dequantize()
    • INetworkDefinition.add_quantize()
    • INetworkDefinition.set_weights_name()
    • IPluginRegistry.deregister_creator()
    • IRefitter.get_missing_weights()
    • IRefitter.get_all_weights()
    • IRefitter::set_named_weights()
    • IResizeLayer.coordinate_transformation
    • IResizeLayer.nearest_rounding
    • IResizeLayer.selector_for_single_pixel
    • IScaleLayer.channel_axis
    • enum ResizeCoordinateTransformation
    • enum ResizeMode
    • BuilderFlag.SPARSE_WEIGHTS
    • TacticSource.CUDNN
    • TensorFormat.DLA_HWC4
    • TensorFormat.DLA_LINEAR
    • TensorFormat.HWC16

Breaking API Changes

  • Support for Python 2 has been dropped. This means that TensorRT will no longer include wheels for Python 2, and Python samples will not work with Python 2.
  • All API's have been marked as noexcept where appropriate. The IErrorRecorder interface has been fully integrated into the API for error reporting. The Logger is only used as a fallback when the ErrorRecorder is not provided by the user.
  • Callback changes are now marked noexcept, therefore, implementations must also be marked noexcept. TensorRT has never catered to exceptions thrown by callbacks, but this is now captured in the API.
  • Methods that take parameters of type void** where the array of pointers is unmodifiable are now changed to take type void*const*.
  • Dims is now a type alias for class Dims32. Code that forward-declares Dims should forward-declare class Dims32; using Dims = Dims32;.

Compatibility

  • TensorRT 8.0.0 EA has been tested with the following:
  • This TensorRT release supports CUDA:
    Note: There are two TensorRT binary builds for CUDA 11.0 and CUDA 11.3. The build for CUDA 11.3 is compatible with CUDA 11.1 and CUDA 11.2 libraries. For both builds, CUDA driver compatible with the runtime CUDA version is required (see Table 2 here). For the CUDA 11.3 build, driver version 465 or above is suggested for best performance.
  • It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Limitations

  • For QAT networks, TensorRT 8.0 supports per-tensor and per-axis quantization scales for weights. For activations, only per-tensor quantization is supported. Only symmetric quantization is supported and zero-point weights may be omitted or, if zero-points are provided, all coefficients must have a value of zero.
  • Loops and DataType::kBOOL are not supported when the static TensorRT library is used. Performance improvements for transformer based architectures such as BERT will also not be available when using static TensorRT library.
  • When using reformat-free I/O, the extent of a tensor in a vectorized dimension might not be a multiple of the vector length. Elements in a partially occupied vector that are not within the tensor are referred to here as vector-padding. For example:
    • On GPU
      • for input tensors, the application shall set vector-padding elements to zero.
      • for output tensors, the value of vector-padding elements is undefined. In a future release, TensorRT will support setting them to zero.
    • On DLA
      • for input tensors, vector-padding elements are ignored.
      • for output tensors, vector-padding elements are unmodified.
  • When running INT8 networks on DLA using TensorRT, operations must be added to the same subgraph to reduce quantization errors across the subgraph of the network that runs on the DLA by allowing them to fuse and retain higher precision for intermediate results. Breaking apart the subgraph in order to inspect intermediate results by setting the tensors as network output tensors, can result in different levels of quantization errors due to these optimizations being disabled.
  • If both kSPARSE_WEIGHTS and kREFIT flags are set in IBuilderConfig, the convolution layers having structured sparse kernel weights cannot be refitted with new kernel weights which do not have structured sparsity. The IRefitter::setWeights() will print an error and return false in that case.

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.0.0:
  • Deprecation is used to inform developers that some APIs and tools are no longer recommended for use. TensorRT has the following deprecation policy:
    • This policy comes into effect beginning with TensorRT 8.0.
    • Deprecation notices are communicated in the release notes. Deprecated API elements are marked with the TRT_DEPRECATED macro where possible.
    • TensorRT provides a 12-month migration period after the deprecation. For any APIs and tools deprecated in TensorRT 7.x, the 12-month migration period starts from the TensorRT 8.0 GA release date.
    • APIs and tools will continue to work during the migration period.
    • After the migration period ends, we reserve the right to remove the APIs and tools in a future release.
  • IRNNLayer was deprecated in TensorRT 4.0 and has been removed in TensorRT 8.0. IRNNv2Layer was deprecated in TensorRT 7.2.1. IRNNv2Layer has been deprecated in favor of the loop API, however, it is still available for backwards compatibility. For more information about the loop API, refer to the sampleCharRNN sample with the --Iloop option as well as the Working With Loops chapter.
  • IPlugin and IPluginFactory interfaces were deprecated in TensorRT 6.0 and have been removed in TensorRT 8.0. We recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt and IPluginV2IOExt interfaces. For more information, refer to the Migrating Plugins From TensorRT 6.x Or 7.x To TensorRT 8.x.x section.
  • We removed samplePlugin since it was meant to demonstrate the IPluginExt interface, which is no longer supported in TensorRT 8.0.
  • We have deprecated the Caffe Parser and UFF Parser in TensorRT 7.0. They are still tested and functional in TensorRT 8.0, however, we plan to remove the support in the future. Ensure you migrate your workflow to use tf2onnx, keras2onnx or TensorFlow-TensorRT (TF-TRT) for deployment.

    If using UFF, ensure you migrate to the ONNX workflow through enablement of a plugin. ONNX workflow is not dependent on plugin enablement. For plugin enablement of a plugin on ONNX, refer to Estimating Depth with ONNX Models and Custom Layers Using NVIDIA TensorRT.

    Caffe and UFF-specific topics in the Developer Guide have been moved to the Appendix section until removal in the subsequent major release.

  • Interface functions that provided a destroy function are deprecated in TensorRT 8.0. The destructors will be exposed publicly in order for the delete operator to work as expected on these classes.
  • nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION is deprecated. Networks that have QuantizeLayer and DequantizeLayer layers will be automatically processed using Q/DQ-processing, which includes explicit-precision semantics. Explicit precision is a network-optimizer constraint that prevents the optimizer from performing precision-conversions that are not dictated by the semantics of the network. For more information, refer to the Working With QAT Networks section in the TensorRT Developer Guide.
  • nvinfer1::IResizeLayer::setAlignCorners and nvinfer1::IResizeLayer::getAlignCorners are deprecated. Use nvinfer1::IResizeLayer::setCoordinateTransformation, nvinfer1::IResizeLayer::setSelectorForSinglePixel and nvinfer1::IResizeLayer::setNearestRounding instead.
  • Destructors for classes with destroy() methods were previously protected. They are now public, enabling use of smart pointers for these classes. The destroy() methods are deprecated.
  • The following C++ API functions, types, and a field, which were previously deprecated, were removed:
    Core Library:
    • DimensionType
    • Dims::Type
    • class DimsCHW
    • class DimsNCHW
    • class IOutputDimensionFormula
    • class IPlugin
    • class IPluginFactory
    • class IPluginLayer
    • class IRNNLayer
    • IBuilder::getEngineCapability()
    • IBuilder::allowGPUFallback()
    • IBuilder::buildCudaEngine()
    • IBuilder::canRunOnDLA()
    • IBuilder::createNetwork()
    • IBuilder::getAverageFindIterations()
    • IBuilder::getDebugSync()
    • IBuilder::getDefaultDeviceType()
    • IBuilder::getDeviceType()
    • IBuilder::getDLACore()
    • IBuilder::getFp16Mode()
    • IBuilder::getHalf2Mode()
    • IBuilder::getInt8Mode()
    • IBuilder::getMaxWorkspaceSize()
    • IBuilder::getMinFindIterations()
    • IBuilder::getRefittable()
    • IBuilder::getStrictTypeConstraints()
    • IBuilder::isDeviceTypeSet()
    • IBuilder::reset()
    • IBuilder::resetDeviceType()
    • IBuilder::setAverageFindIterations()
    • IBuilder::setDebugSync()
    • IBuilder::setDefaultDeviceType()
    • IBuilder::setDeviceType()
    • IBuilder::setDLACore()
    • IBuilder::setEngineCapability()
    • IBuilder::setFp16Mode()
    • IBuilder::setHalf2Mode()
    • IBuilder::setInt8Calibrator()
    • IBuilder::setInt8Mode()
    • IBuilder::setMaxWorkspaceSize()
    • IBuilder::setMinFindIterations()
    • IBuilder::setRefittable()
    • IBuilder::setStrictTypeConstraints()
    • ICudaEngine::getWorkspaceSize()
    • IMatrixMultiplyLayer::getTranspose()
    • IMatrixMultiplyLayer::setTranspose()
    • INetworkDefinition::addMatrixMultiply()
    • INetworkDefinition::addPlugin()
    • INetworkDefinition::addPluginExt()
    • INetworkDefinition::addRNN()
    • INetworkDefinition::getConvolutionOutputDimensionsFormula()
    • INetworkDefinition::getDeconvolutionOutputDimensionsFormula()
    • INetworkDefinition::getPoolingOutputDimensionsFormula()
    • INetworkDefinition::setConvolutionOutputDimensionsFormula()
    • INetworkDefinition::setDeconvolutionOutputDimensionsFormula()
    • INetworkDefinition::setPoolingOutputDimensionsFormula()
    • ITensor::getDynamicRange()
    • TensorFormat::kNHWC8
    • TensorFormat::NCHW
    • TensorFormat::kNC2HW2
    Plugins: The following plugin classes were removed:
    • class INvPlugin
    • createLReLUPlugin()
    • createClipPlugin()
    • PluginType
    • struct SoftmaxTree

    Plugin interface methods: For plugins based on IPluginV2DynamicExt and IPluginV2IOExt, certain methods with legacy function signatures (derived from IPluginV2 and IPluginV2Ext base classes) which were deprecated and marked for removal in TensorRT 8.0 will no longer be available. Plugins using these interface methods must stop using them or implement the versions with updated signatures, as applicable.

    Unsupported plugin methods removed in TensorRT 8.0:
    • IPluginV2DynamicExt::canBroadcastInputAcrossBatch()
    • IPluginV2DynamicExt::isOutputBroadcastAcrossBatch()
    • IPluginV2DynamicExt::getTensorRTVersion()
    • IPluginV2IOExt::configureWithFormat()
    • IPluginV2IOExt::getTensorRTVersion()
    Use updated versions for supported plugin methods:
    • IPluginV2DynamicExt::configurePlugin()
    • IPluginV2DynamicExt::enqueue()
    • IPluginV2DynamicExt::getOutputDimensions()
    • IPluginV2DynamicExt::getWorkspaceSize()
    • IPluginV2IOExt::configurePlugin()
    Use newer methods for the following:
    • IPluginV2DynamicExt::supportsFormat() has been removed,use IPluginV2DynamicExt::supportsFormatCombination() instead.
    • IPluginV2IOExt::supportsFormat() has been removed,use IPluginV2IOExt::supportsFormatCombination() instead.
    Caffe Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
    UFF Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
  • The following Python API functions, which were previously deprecated, were removed:
    Core library:
    • class DimsCHW
    • class DimsNCHW
    • class IPlugin
    • class IPluginFactory
    • class IPluginLayer
    • class IRNNLayer
    • Builder.build_cuda_engine()
    • Builder.average_find_iterations
    • Builder.debug_sync
    • Builder.fp16_mode
    • IBuilder.int8_mode
    • Builder.max_workspace_size
    • Builder.min_find_iterations
    • Builder.refittable
    • Builder.strict_type_constraints
    • ICudaEngine.max_workspace_size
    • IMatrixMultiplyLayer.transpose0
    • IMatrixMultiplyLayer.transpose0
    • INetworkDefinition.add_matrix_multiply_deprecated()
    • INetworkDefinition.add_plugin()
    • INetworkDefinition.add_plugin_ext()
    • INetworkDefinition.add_rnn()
    • INetworkDefinition.convolution_output_dimensions_formula
    • INetworkDefinition.deconvolution_output_dimensions_formula
    • INetworkDefinition.pooling_output_dimensions_formula
    • ITensor.get_dynamic_range()
    • Dims.get_type()
    • TensorFormat.HWC8
    • TensorFormat.NCHW
    • TensorFormat.NCHW2
    Caffe Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
    UFF Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • setPluginFactory()
    • setPluginFactoryExt()
    Plugins:
    • class INvPlugin
    • createLReLUPlugin()
    • createClipPlugin()
    • PluginType
    • struct SoftmaxTree
  • The following Python API functions were removed:
    Core library:
    • class DimsCHW
    • class DimsNCHW
    • class IPlugin
    • class IPluginFactory
    • class IPluginLayer
    • class IRNNLayer
    • Builder.build_cuda_engine()
    • Builder.average_find_iterations
    • Builder.debug_sync
    • Builder.fp16_mode
    • IBuilder.int8_mode
    • Builder.max_workspace_size
    • Builder.min_find_iterations
    • Builder.refittable
    • Builder.strict_type_constraints
    • ICudaEngine.max_workspace_size
    • IMatrixMultiplyLayer.transpose0
    • IMatrixMultiplyLayer.transpose0
    • INetworkDefinition.add_matrix_multiply_deprecated()
    • INetworkDefinition.add_plugin()
    • INetworkDefinition.add_plugin_ext()
    • INetworkDefinition.add_rnn()
    • INetworkDefinition.convolution_output_dimensions_formula
    • INetworkDefinition.deconvolution_output_dimensions_formula
    • INetworkDefinition.pooling_output_dimensions_formula
    • ITensor.get_dynamic_range()
    • Dims.get_type()
    • TensorFormat.HWC8
    • TensorFormat.NCHW
    • TensorFormat.NCHW2
    Caffe Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • CaffeParser.plugin_factory
    • CaffeParser.plugin_factory_ext
    UFF Parser:
    • class IPluginFactory
    • class IPluginFactoryExt
    • UffParser.plugin_factory
    • UffParser.plugin_factory_ext

Fixed Issues

  • Improved build times for convolution layers with dynamic shapes and large range of leading dimensions.
  • TensorRT 8.0 no longer requires libcublas.so.* to be present on your system when running an application which was linked with the TensorRT static library. The TensorRT static library now requires cuBLAS and other dependencies to be linked at link time and will no longer open these libraries using dlopen().
  • TensorRT 8.0 no longer requires an extra Identity layer between the ElementWise and the Constant whose rank is > 4. For TensorRT 7.x versions, cases like Convolution and FullyConnected with bias where ONNX decomposes the bias to ElementWise, there was a fusion which didn’t support per element scale. We previously inserted an Identity to workaround this.
  • There was a known performance regression compared to TensorRT 7.1 for Convolution layers with kernel size greater than 5x5. For example, it could lead up to 35% performance regression of the VGG16 UFF model compared to TensorRT 7.1. This issue has been fixed in this release.
  • When running networks such as Cortana, LSTM Peephole, MLP, and Faster RCNN, there was a 5% to 16% performance regression on GA102 devices and a 7% to 36% performance regression on GA104 devices. This issue has been fixed in this release. (not applicable for Jetson platforms)
  • Some RNN networks such as Cortana with FP32 precision and batch size of 8 or higher had a 20% performance loss with CUDA 11.0 or higher compared to CUDA 10.2. This issue has been fixed in this release.

Announcements

  • TensorRT 8.0 will be the last TensorRT release that will provide support for Ubuntu 16.04. This also means TensorRT 8.0 will be the last TensorRT release that will support Python 3.5.
  • Python samples use a unified data downloading workflow. Each sample has a YAML (download.yml) describing the data files that are required to download via a link before running the sample, if any. The download tool parses the YAML and downloads the data files. All other sample code assumes that the data has been downloaded before the code is invoked. An error will be raised if the data is not correctly downloaded. Refer to the Python sample documentation for more information.

Known Issues

  • The diagram in IRNNv2Layer is incorrect. This will be fixed in a future release.
  • There is a known issue that graph capture may fail in some cases for IExecutionContext::enqueue() and IExecutionContext::enqueueV2(). For more information, refer to the documentation for IExecutionContext::enqueueV2(), including how to work around this issue.
  • Some fusions are not enabled when the TensorRT static library is used. This means there is a performance loss of around 10% for networks like BERT and YOLO3 when linking with the static library compared to the dynamic library. The performance loss depends on precision used and batch size and it can be up to 60% in some cases.
  • The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit those weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().
  • The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.
  • There is a known performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on Maxwell and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation on Nano.
  • There is an up to 15% performance regression compared to TensorRT 7.2.3 for QuartzNet variants on Volta GPUs.
  • There is an up to 150% performance regression compared to TensorRT 7.2.3 for 3D U-Net variants on NVIDIA Ampere GPUs if the workspace size is limited to 1GB. Enlarging the workspace size (for example, to 2GB) can workaround this issue.
  • There is a known issue that TensorRT selects kLINEAR format when the user uses reformat-free I/O with vectorized formats and with input/output tensors which have only 3 dimensions. The workaround is to add an additional dimension to the tensors with size 1 to make them 4 dimensional tensors.
  • CuTensor based algorithms on TensorRT 8.0 EA are known to have significant performance regressions due to an issue with the CUDA 11.3 compiler (5x-10x slower than CUDA 11.0 builds). This is due to a compiler regression and the performance should be recovered with a future CUDA release.
  • When running TensorRT 8.0.0 with cuDNN 8.2.0, there is a known performance regression for the deconvolution layer compared to running with previous cuDNN releases. For example, some deconvolution layers can have up to 7x performance regression on Turing GPUs compared to running with cuDNN 8.0.4. This will be fixed in a future cuDNN release.
  • There is a known false alarm reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommended way for suppressing the false alarm is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool.
    {
       Ignore the dlopen false alarm.
       Memcheck:Leak
       ...
       fun:_dl_open
       ...
    }
    
  • There is an up to 8% performance regression compared to TensorRT 7.2.3 for DenseNet variants on Volta GPUs.
  • There is an up to 24% performance regression compared to TensorRT 7.2.3 for networks containing Slice layers on Turing GPUs.
  • While using the TensorRT static library, users are still required to have the cuDNN/cuBLAS dynamic libraries installed at runtime. This issue will be resolved in the GA release so that cuDNN/cuBLAS static libraries will always be used instead.
  • An issue was discovered while compiling the TensorRT samples using the TensorRT static libraries with a GCC version older than 5.x. When using RHEL/CentOS 7.x, you may observe a crash with the error message munmap_chunk(): invalid pointer if the patch below is not applied. More details regarding this issue with a workaround for your own application can be found in the TensorRT Sample Support Guide.
    --- a/samples/Makefile.config
    +++ b/samples/Makefile.config
    @@ -331,13 +331,13 @@ $(OUTDIR)/$(OUTNAME_DEBUG) : $(DOBJS) $(CUDOBJS)
     else
     $(OUTDIR)/$(OUTNAME_RELEASE) : $(OBJS) $(CUOBJS)
     	$(ECHO) Linking: $@
    -	$(AT)$(CC) -o $@ $^ $(LFLAGS) -Wl,--start-group $(LIBS) -Wl,--end-group
    +	$(AT)$(CC) -o $@ $(LFLAGS) -Wl,--start-group $(LIBS) $^ -Wl,--end-group
     	# Copy every EXTRA_FILE of this sample to bin dir
     	$(foreach EXTRA_FILE,$(EXTRA_FILES), cp -f $(EXTRA_FILE)$(OUTDIR)/$(EXTRA_FILE); )
     
     $(OUTDIR)/$(OUTNAME_DEBUG) : $(DOBJS) $(CUDOBJS)
     	$(ECHO) Linking: $@
    -	$(AT)$(CC) -o $@ $^ $(LFLAGSD) -Wl,--start-group $(DLIBS) -Wl,--end-group
    +	$(AT)$(CC) -o $@ $(LFLAGSD) -Wl,--start-group $(DLIBS) $^ -Wl,--end-group
     endif
     
     $(OBJDIR)/%.o: %.cpp
    
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS.