TensorRT Release Notes

NVIDIA TensorRT is a C++ library that facilitates high performance inference on NVIDIA GPUs. It is designed to work in connection with deep learning frameworks that are commonly used for training. TensorRT focuses specifically on running an already trained network quickly and efficiently on a GPU for the purpose of generating a result; also known as inferencing. These release notes describe the key features, software enhancements and improvements, and known issues for the TensorRT 9.2.0 product package.

For previously released TensorRT documentation, refer to the TensorRT Archives.

2. TensorRT Release 9.x.x

1.1. TensorRT Release 9.2.0

These are the TensorRT 9.2.0 Release Notes and are applicable to x86 Linux users and Arm®-based CPU cores for Server Base System Architecture (SBSA) users on Linux. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

This GA release is for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only. For applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1 for production use.

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Announcements

  • Added support for NVIDIA GH200 Grace Hopper Superchip.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • TensorRT 9.2 provides enhanced support for Large Language Models (LLMs). Refer to TensorRT-LLM for a list of supported LLMs.
  • The following C++ API classes were added:
    • ISerializationConfig
  • The following C++ API methods were added:
    • ICudaEngine::createSerializationConfig()
    • ICudaEngine::serializeWithConfig(ISerializationConfig& config)
  • The following C++ API enums were added:
    • SerializationFlag::kEXCLUDE_WEIGHTS
    • SerializationFlag::kEXCLUDE_LEAN_RUNTIME
  • The following Python API classes were added:
    • ISerializationConfig
  • The following Python API methods were added:
    • ICudaEngine::create_serialization_config()
    • ICudaEngine::serialize_with_config(config)
  • The following Python API enums were added:
    • SerializationFlag.EXCLUDE_WEIGHTS
    • SerializationFlag.EXCLUDE_LEAN_RUNTIME
  • Added a new kWEIGHTLESS builder flag to build the engine without saving weights with no impact on runtime performance. You need to use the refit API to pull those weights back before inference.
  • Added a new serializeWithConfig API to serialize the engine with optional new flags kEXCLUDE_WEIGHTS and kEXCLUDE_LEAN_RUNTIME.
  • Added support for FP32 accumulation (single precision) for inputs/outs in FP16 (half precision) format.

Breaking API Changes

  • Attention: In TensorRT 9.0, due to the introduction of INT64 as a supported data type, ONNX models with INT64 I/O require INT64 bindings. Note that prior to this release, such models required INT32 bindings.
  • In TensorRT 9.0, we removed ICaffeParser, IUffParser, and related classes and functions. The following APIs are removed:
    • nvcaffeparser1::IBlobNameToTensor
    • nvcaffeparser1::IBinaryProtoBlob
    • nvcaffeparser1::IPluginFactoryV2
    • nvcaffeparser1::ICaffeParser
    • nvcaffeparser1::createCaffeParser
    • nvcaffeparser1::shutdownProtobufLibrary
    • createNvCaffeParser_INTERNAL
    • nvinfer1::utils::reshapeWeights
    • nvinfer1::utils::reorderSubBuffers
    • nvinfer1::utils::ransposeSubBuffers
    • nvuffparser::UffInputOrder
    • nvuffparser::FieldType
    • nvuffparser::FieldMap
    • nvuffparser::FieldCollection
    • nvuffparser::IUffParser
    • nvuffparser::createUffParser
    • nvuffparser::shutdownProtobufLibrary
    • createNvUffParser_INTERNAL
  • With removal of ICaffeParser and IUffParsers, the libnvparsers library is removed.
  • uff, graphsurgeon, and related networks are removed from TensorRT packages.

Deprecated API Lifetime

  • APIs deprecated in TensorRT 9.2 will be retained until at least 11/2024.
  • APIs deprecated in TensorRT 9.1 will be retained until at least 10/2024.
  • APIs deprecated in TensorRT 9.0 will be retained until at least 8/2024.
  • APIs deprecated in TensorRT 8.6 will be retained until at least 2/2024.
  • APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.
  • APIs deprecated in TensorRT 8.4 or before will be removed in TensorRT 10.0.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • The DLA compiler is capable of removing identity transposes, but it cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshape). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshapes in between. The shuffle layer is translated into two consecutive DLA transpose layers, unless you merge the transposes together manually in the model definition in advance.
  • In explicitly quantized networks, a group convolution that has a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, on NVIDIA Hopper™ it may fall back to FP32-IN-FP32-OUT if the input channel count is small. This will be fixed in a future release.
  • On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and PyTorch are unable to run due to missing Python module dependencies. These frameworks have not been built for PowerPC and/or published to standard repositories.
  • nvinfer1::UnaryOperation::kROUND or nvinfer1::UnaryOperation::kSIGN operations of IUnaryLayer are not supported in the implicit batch mode.
  • For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset that contains the corresponding function ops, for example: opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to corresponding implementation with primitive ops for normalization layers.
  • QuantizeLayer and DequantizeLayer only support FP32 scale and data, even when using ONNX opset 19. If the input is not FP32, you must add a Cast to FP32 on the input to QuantizeLayer, and a Cast from FP32 at the output of DequantizeLayer.
  • EngineInspector::getLayerInformation may return incomplete JSON data for some engines produced by TensorRT 9.0. When this happens, TensorRT Engine Explorer cannot be used to analyze the engine or generate a graph of the engine layers.
  • The released Windows DLLs are built with the MT_Static flag. TensorRT will switch back to the MT_Dynamic flag in the next major release.

Deprecated and Removed Features

The following features have been deprecated or removed in TensorRT 9.0. Some deprecations that were planned to be removed in 9.0, but have not yet been removed, may be removed in TensorRT 10.0.
  • TensorRT samples and open source demos are no longer supported on Python < 3.8.
  • Ubuntu 18.04 has reached end of life and is no longer supported by TensorRT starting with TensorRT 9.0.
  • The following plugins were deprecated:
    • BatchedNMS_TRT
    • BatchedNMSDynamic_TRT
    • BatchTilePlugin_TRT
    • Clip_TRT
    • CoordConvAC
    • CropAndResize
    • EfficientNMS_ONNX_TRT
    • CustomGeluPluginDynamic
    • LReLU_TRT
    • NMSDynamic_TRT
    • NMS_TRT
    • Normalize_TRT
    • Proposal
    • SingleStepLSTMPlugin
    • SpecialSlice_TRT
    • Split
  • The following C++ API classes were deprecated:
    • NvUtils
  • The following C++ API methods were deprecated:
    • nvinfer1::INetworkDefinition::addFill(nvinfer1::Dims dimensions, nvinfer1::FillOperation
              op)
      - Only the 2-parameter version of this function is deprecated.
    • nvinfer1::INetworkDefinition::addDequantize(nvinfer1::ITensor &input, nvinfer1::ITensor
              &scale)
      - Only the 2-parameter version of this function is deprecated.
    • nvinfer1::INetworkDefinition::addQuantize(nvinfer1::ITensor &input, nvinfer1::ITensor
              &scale)
      - Only the 2-parameter version of this function is deprecated.
  • The following C++ API enums were deprecated:
    • nvinfer1::TacticSource::kCUBLAS_LT
    • nvonnxparser::OnnxParserFlag::kNATIVE_INSTANCENORM
  • The following Python API methods were deprecated:
    • INetworkDefinition.add_fill(shape, op)
    • INetworkDefinition.add_dequantize(input, scale)
  • The following Python API enums were deprecated:
    • TacticSource.CUBLAS_LT
    • OnnxParserFlag.NATIVE_INSTANCENORM

Fixed Issues

  • There was an 1x FP32 model CPU memory leak from onnx.export tracked in issue 106976 for the TensorRT open source software HuggingFace demo. You may have encountered higher peak CPU memory usage for fresh runs, but will drop for cached-ONNX runs. This issue has been fixed.
  • The compute sanitizer synccheck tool may have reported a false alarm when running conv and GEMM kernels with CGA size >= 4 on NVIDIA Hopper GPUs. This issue has been fixed.
  • When using cuDNN 8.9.2.26 and the TensorRT RNNv2 API, the compute sanitizer from CUDA Toolkit 12.2 may have reported race conditions in CUTLASS kernels. This issue has been fixed.
  • There was a known floating point exception when running BERT networks on H100 multi-instance GPU (MIG). This issue has been fixed.
  • TensorRT did not preserve precision for operations that are imported from ONNX models if using weakly typed networks. The fix for the issue is to import networks as strongly typed which preserve precision for operations.

Known Issues

Functional
  • CUDA compute sanitizer may report racecheck hazards for some legacy kernels written directly in SASS, however, related kernels do not have functional issues at runtime.
  • There is a known issue that the compute sanitizer in CUDA Toolkit 12.3 might cause target application crash.
  • The compute sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.
  • Multihead attention fusion might not happen and affect performance if the number of heads is small.
  • If a ONNX model contains a Range operator and its limit input is a data-dependent tensor, engine building will likely fail.
  • Hardware forward compatibility (HFC) is broken on L4T Concord for ViT, Swin-Transformers, and BERT networks in FP16 mode. A workaround is to only use FP32 mode on L4T Concord or turn off HFC.
  • Compute Sanitizer from CUDA Toolkit 12.0/12.1 may report a false alarm about invalid memory access in generatedNativePointwise kernels.
  • If a network has a tensor of type bool with an implicitly data-dependent shape, engine building will likely fail.
  • There is an occurance of use-after-free in NVRTC that has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking with the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.
  • Although the version compatible runtime is optimized for efficiency, it may result in slower performance than the full runtime in certain use cases. Most networks can expect no more than a 10% slowdown when using a version-compatible engine compared to a version-locked engine. However, in some cases, a larger performance drop may occur. For example:
    • When running ResNet50_v2 with QAT, there may be up to a 22% decrease in performance.
    • When running DynUNet in FP16 precision, there may be up to a 32% decrease in performance.
  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
  • Hardware compatible engines built with CUDA versions older than 11.5 may crash during inference when run on a GPU with a compute capability lower than that of the GPU where the engine was built. A workaround is to build on the GPU with the lowest compute capability.
  • For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in kDLA_LINEAR format. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.
Performance
  • There is an up to 9% performance drop for BERT networks with gelu_erf activation in BF16 precision compared to TensorRT 9.1 on NVIDIA Ampere GPUs.
  • There is an up to 40 second increase in engine building for BART networks on NVIDIA Hopper GPUs.
  • There is an accuracy drop running OSS HuggingFace Demo gptj-6b model when batch size > 1.
  • There is an up to 14% context memory usage fluctuations compared to TensorRT 9.1 when building the engine for 3DUnet networks due to different tactics being selected.
  • There is an up to 20 second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.
  • There is an up to 6% performance drop for BERT networks in FP32 precision compared to TensorRT 9.0 on NVIDIA Volta GPUs.
  • There are up to 21% peak GPU memory usage fluctuations when building the engine for the same network back to back due to different tactics being selected.
  • There is an up to 11% performance drop for ViT networks in TF32 precision compared to TensorRT 9.0 on NVIDIA Ampere GPUs.
  • There is an up to 12% performance drop for BERT networks in FP16 precision compared to TensorRT 9.0 on NVIDIA Ada Lovelace GPUs.
  • There is an up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert like models due to additional tactics available for evaluation.
  • There is an up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.
  • There is a known performance regression in the grouped deconvolution layer due to disabling cuDNN tactics. In TensorRT 8.6, performance can be recovered by unsetting nvinfer1::PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805. We will close the performance gap in a future release.
  • There is an up to 27% performance drop for the SegResNet model on Ampere GPUs compared to TensorRT 8.6 EA. This drop can be avoided by enabling the kVERSION_COMPATIBLE flag in the ONNX parser.
  • There is an up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.
  • Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.
  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.
  • There is an up to 30% performance regression for LSTM variants with dynamic shapes. This issue can be resolved by disabling the kFASTER_DYNAMIC_SHAPES_0805 preview feature in TensorRT 8.6.
  • There is a known issue on H100 that may lead to GPU hang when running TensorRT with high persistentCache usage. Limit the usage to 40% of L2 cache size as a workaround.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.
  • There is an up to 23% performance regression compared to TensorRT 8.5 on LSTMs in FP32 precision when dynamic shapes are used on NVIDIA Turing GPUs. Set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false as a workaround.
  • There is an up to 23% performance regression compared to TensorRT 8.5 on Temporal Fusion Transformers in FP32 precision on NVIDIA Turing and NVIDIA Ampere GPUs.
  • A higher builder optimization level does not always give a better performance when compared to a lower builder optimization level; which may happen on all platforms and up to 27%. The workaround is to build an engine using a lower builder optimization level.
  • There is an up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. To work around this, enable FP16 and disable INT8 in the builder config.

1.2. TensorRT Release 9.1.0

These are the TensorRT 9.1.0 Release Notes and are applicable to x86 Linux users and Arm®-based CPU cores for Server Base System Architecture (SBSA) users on Linux. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

This GA release is for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only. For applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1 for production use.

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Announcements

  • Added support for NVIDIA GH200 Grace Hopper Superchip.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • TensorRT 9.1 provides enhanced support for Large Language Models (LLMs), including the following networks:
    • GPT
    • GPT-J
    • GPT-Neo
    • GPT-NeoX
    • BART
    • Bloom
    • Bloomz
    • OPT
    • T5
    • FLAN-T5
  • Added support for bfloat16 data types on NVIDIA Ampere GPUs and newer architectures.
  • Added support for E4M3 FP8 data type on NVIDIA Hopper GPUs using explicit quantization. This allows utilizing TensorRT with TransformerEngine based FP8 models.
  • Added the NeMo demo into the TensorRT OSS repository to demonstrate the performance benefit of using E4M3 FP8 data type with the GPT models trained with the NVIDIA NeMo Toolkit and TransformerEngine.
  • Added bfloat16 and FP8 I/O datatypes in plugins.
  • Added support of networks running in INT8 precision where Smooth Quantization is used. Smooth Quantization is an approach that allows to improve accuracy of INT8 compute for LLM.
  • Added support for INT64 data type. The ONNX parser no longer automatically casts INT64 to INT32.
  • Added support for ONNX local functions when parsing ONNX models with the ONNX parser.
  • Added support for caching JIT-compiled code. It can be disabled by setting BuilderFlag::DISABLE_COMPILATION_CACHE. The compilation cache is part of the timing cache, caches JIT-compiled code, and will be serialized as part of the timing cache by default, which may significantly increase the cache size.
  • Added NetworkDefinitionCreationFlag::kSTRONGLY_TYPED. A strongly typed network faithfully translates the type specification of the network definition to the built engine.
  • Added new refit APIs to accept weights from GPU and refit engines asynchronously. Execution contexts become reusable after refitting.
  • Added two new TensorRT samples; sampleProgressMonitor (C++) and simple_progress_reporter (Python) that are examples for using Progress Monitor during engine build. For more information, refer to the NVIDIA TensorRT Samples Support Guide.
  • Added support for Python-based TensorRT plugin definitions. The TensorRT sample python_plugin has been added with a few examples demonstrating Python-based plugins. For more information, refer to the Adding Custom Layers using the Python API section in the NVIDIA TensorRT Developer Guide.
  • The following C++ API classes were added:
    • IProgressMonitor
  • The following C++ API methods were added:
    • INetworkDefinition::IFillLayer* addFill(Dims dimensions, FillOperation op, DataType outputType)
    • INetworkDefinition::IQuantizeLayer* addQuantize(ITensor& input, ITensor& scale, DataType
              outputType)
    • INetworkDefinition::IDequantizeLayer* addDequantize(ITensor& input, ITensor& scale,
              DataType outputType)
    • IBuilder::setProgressMonitor(IProgressMonitor* monitor)
    • IBuilder::getProgressMonitor()
    • IFillLayer::setAlphaInt64(int64_t alpha)
    • IFillLayer::getAlphaInt64()
    • IFillLayer::setBetaInt64(int64_t beta)
    • IFillLayer::getBetaInt64()
    • IFillLayer::isAlphaBetaInt64()
    • IFillLayer::getToType()
    • IFillLayer::setToType(DataType toType)
    • IQuantizeLayer::getToType()
    • IQuantizeLayer::setToType(DataType toType)
    • IDequantizeLayer::getToType()
    • IDequantizeLayer::setToType(DataType toType)
    • INetworkDefinition::getFlags()
    • INetworkDefinition::getFlag(NetworkDefinitionCreationFlag
              networkDefinitionCreationFlag)
    • IRefitter::setNamedWeights(char const* name, Weights weights, TensorLocation
              location)
    • IRefitter::getNamedWeights(char const* weightsName)
    • IRefitter::getWeightsLocation(char const* weightsName)
    • IRefitter::unsetNamedWeights(char const* weightsName)
    • IRefitter::setWeightsValidation(bool weightsValidation)
    • IRefitter::getWeightsValidation()
    • IRefitter::refitCudaEngineAsync(cudaStream_t stream)
    • nvonnxparser::IParserError::nodeName()
    • nvonnxparser::IParserError::nodeOperator()
  • The following C++ API enums were added:
    • IBuilderFlag::kFP8
    • IBuilderFlag::kERROR_ON_TIMING_CACHE_MISS
    • IBuilderFlag::kBF16
    • IBuilderFlag::kDISABLE_COMPILATION_CACHE
    • DataType::kBF16
    • DataType::kINT64
    • NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
    • nvonnxparser::ErrorCode::kUNSUPPORTED_NODE_ATTR
    • nvonnxparser::ErrorCode::kUNSUPPORTED_NODE_INPUT
    • nvonnxparser::ErrorCode::kUNSUPPORTED_NODE_DATATYPE
    • nvonnxparser::ErrorCode::kUNSUPPORTED_NODE_DYNAMIC
    • nvonnxparser::ErrorCode::kUNSUPPORTED_NODE_SHAPE
  • The following Python API classes were added:
    • IProgressMonitor
  • The following Python API methods were added:
    • INetworkDefinition.add_fill(shape, op, output_type)
    • INetworkDefinition.add_dequantize(input, scale, output_type)
    • INetworkDefinition.add_quantize(input, scale, output_type)
    • INetworkDefinition.get_flag
    • IRefitter.set_named_weights(name, weights, location)
    • IRefitter.get_named_weights(weights_name)
    • IRefitter.get_weights_location(weights_name)
    • IRefitter.unset_named_weights(weights_name)
    • IRefitter.refit_cuda_engine_async(stream_handle)
  • The following Python API attributes were added:
    • IBuilder.progress_monitor
    • IFillLayer.is_alpha_beta_int64
    • IFillLayer.to_type
    • IQuantizeLayer.to_type
    • IDequantizeLayer.to_type
    • INetworkDefinition.flags
    • IRefitter.weights_validation
    • IParserError.node_name
    • IParserError.node_operator
  • The following Python API enums were added:
    • IBuilderFlag.FP8
    • IBuilderFlag.ERROR_ON_TIMING_CACHE_MISS
    • IBuilderFlag.BF16
    • IBuilderFlag.DISABLE_COMPILATION_CACHE
    • DataType.BF16
    • DataType.INT64
    • NetworkDefinitionCreationFlag.STRONGLY_TYPED
    • ErrorCode.UNSUPPORTED_NODE_ATTR
    • ErrorCode.UNSUPPORTED_NODE_INPUT
    • ErrorCode.UNSUPPORTED_NODE_DATATYPE
    • ErrorCode.UNSUPPORTED_NODE_DYNAMIC
    • ErrorCode.kUNSUPPORTED_NODE_SHAPE
  • The IEngineInspector now prints more detailed layer information for LSTMs and Transformers networks.

Breaking API Changes

  • Attention: In TensorRT 9.0, due to the introduction of INT64 as a supported data type, ONNX models with INT64 I/O require INT64 bindings. Note that prior to this release, such models required INT32 bindings.
  • In TensorRT 9.0, we removed ICaffeParser, IUffParser, and related classes and functions. The following APIs are removed:
    • nvcaffeparser1::IBlobNameToTensor
    • nvcaffeparser1::IBinaryProtoBlob
    • nvcaffeparser1::IPluginFactoryV2
    • nvcaffeparser1::ICaffeParser
    • nvcaffeparser1::createCaffeParser
    • nvcaffeparser1::shutdownProtobufLibrary
    • createNvCaffeParser_INTERNAL
    • nvinfer1::utils::reshapeWeights
    • nvinfer1::utils::reorderSubBuffers
    • nvinfer1::utils::ransposeSubBuffers
    • nvuffparser::UffInputOrder
    • nvuffparser::FieldType
    • nvuffparser::FieldMap
    • nvuffparser::FieldCollection
    • nvuffparser::IUffParser
    • nvuffparser::createUffParser
    • nvuffparser::shutdownProtobufLibrary
    • createNvUffParser_INTERNAL
  • With removal of ICaffeParser and IUffParsers, the libnvparsers library is removed.
  • uff, graphsurgeon, and related networks are removed from TensorRT packages.

Deprecated API Lifetime

  • APIs deprecated in TensorRT 9.1 will be retained until at least 10/2024.
  • APIs deprecated in TensorRT 9.0 will be retained until at least 8/2024.
  • APIs deprecated in TensorRT 8.6 will be retained until at least 2/2024.
  • APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.
  • APIs deprecated in TensorRT 8.4 or before will be removed in TensorRT 10.0.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • The DLA compiler is capable of removing identity transposes, but it cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshape). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshapes in between. The shuffle layer is translated into two consecutive DLA transpose layers, unless you merge the transposes together manually in the model definition in advance.
  • In explicitly quantized networks, a group convolution that has a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, on NVIDIA Hopper™ it may fall back to FP32-IN-FP32-OUT if the input channel count is small. This will be fixed in a future release.
  • On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and PyTorch are unable to run due to missing Python module dependencies. These frameworks have not been built for PowerPC and/or published to standard repositories.
  • nvinfer1::UnaryOperation::kROUND or nvinfer1::UnaryOperation::kSIGN operations of IUnaryLayer are not supported in the implicit batch mode.
  • For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset that contains the corresponding function ops, for example: opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to corresponding implementation with primitive ops for normalization layers.
  • QuantizeLayer and DequantizeLayer only support FP32 scale and data, even when using ONNX opset 19. If the input is not FP32, you must add a Cast to FP32 on the input to QuantizeLayer, and a Cast from FP32 at the output of DequantizeLayer.
  • EngineInspector::getLayerInformation may return incomplete JSON data for some engines produced by TensorRT 9.0. When this happens, TensorRT Engine Explorer cannot be used to analyze the engine or generate a graph of the engine layers.

Deprecated and Removed Features

The following features have been deprecated or removed in TensorRT 9.0. Some deprecations that were planned to be removed in 9.0, but have not yet been removed, may be removed in TensorRT 10.0.
  • TensorRT samples and open source demos are no longer supported on Python < 3.8.
  • Ubuntu 18.04 has reached end of life and is no longer supported by TensorRT starting with TensorRT 9.0.
  • The following plugins were deprecated:
    • BatchedNMS_TRT
    • BatchedNMSDynamic_TRT
    • BatchTilePlugin_TRT
    • Clip_TRT
    • CoordConvAC
    • CropAndResize
    • EfficientNMS_ONNX_TRT
    • CustomGeluPluginDynamic
    • LReLU_TRT
    • NMSDynamic_TRT
    • NMS_TRT
    • Normalize_TRT
    • Proposal
    • SingleStepLSTMPlugin
    • SpecialSlice_TRT
    • Split
  • The following C++ API classes were deprecated:
    • NvUtils
  • The following C++ API methods were deprecated:
    • nvinfer1::INetworkDefinition::addFill(nvinfer1::Dims dimensions, nvinfer1::FillOperation
              op)
      - Only the 2-parameter version of this function is deprecated.
    • nvinfer1::INetworkDefinition::addDequantize(nvinfer1::ITensor &input, nvinfer1::ITensor
              &scale)
      - Only the 2-parameter version of this function is deprecated.
    • nvinfer1::INetworkDefinition::addQuantize(nvinfer1::ITensor &input, nvinfer1::ITensor
              &scale)
      - Only the 2-parameter version of this function is deprecated.
  • The following C++ API enums were deprecated:
    • nvinfer1::TacticSource::kCUBLAS_LT
    • nvonnxparser::OnnxParserFlag::kNATIVE_INSTANCENORM
  • The following Python API methods were deprecated:
    • INetworkDefinition.add_fill(shape, op)
    • INetworkDefinition.add_dequantize(input, scale)
  • The following Python API enums were deprecated:
    • TacticSource.CUBLAS_LT
    • OnnxParserFlag.NATIVE_INSTANCENORM

Fixed Issues

  • There was an up to 9% performance regression compared to TensorRT 8.5 on Yolov3 batch size 1 in FP32 precision on NVIDIA Ada Lovelace GPUs. This issue has been fixed.
  • There was an up to 13% performance regression compared to TensorRT 8.5 on GPT2 without kv-cache in FP16 precision when dynamic shapes are used on NVIDIA Volta and NVIDIA Ampere GPUs. The workaround was to set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false. This issue has been fixed.
  • For transformer-based networks such as BERT and GPT, TensorRT could consume CPU memory up to 2 times the model size during compilation plus any additional formats timed. This issue has been fixed.
  • In some cases, when using the OBEY_PRECISION_CONSTRAINTS builder flag and the required type was set to FP32, the network could fail with a missing tactic due to an incorrect optimization converting the output of an operation to FP16. This could be resolved by removing the OBEY_PRECISION_CONSTRAINTS option. This issue has been fixed.
  • TensorRT could fail on devices with small RAM and swap space. This could be resolved by ensuring the RAM and swap space is at least 7 times the size of the network. For example, at least 21 GB of combined CPU memory for a 3 GB network. This issue has been fixed.
  • If an IShapeLayer was used to get the output shape of an INonZeroLayer, engine building would likely fail. This issue has been fixed.
  • If a one-dimension INT8 input was used for a Unary or ElementWise operation, engine building would throw an internal error “Could not find any implementation for node”. This issue has been fixed.
  • Using the compute sanitizer racecheck tool could cause the process to be terminated unexpectedly. The root cause was a wrong false alarm. The issue could be bypassed with --kernel-regex-exclude kns=scudnn_winograd. This issue has been fixed.
  • TensorRT in FP16 mode did not perform cast operations correctly when only the output types were set, but not the layer precisions. This issue has been fixed.
  • There was a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform. This issue has been fixed.
  • When using DLA, INT8 convolutions followed by FP16 layers could cause accuracy degradation. In such cases, you could either change the convolution to FP16 or the subsequent layer to INT8. This issue has been fixed.
  • Using the compute sanitizer tool from CUDA 12.0 could report a cudaErrorIllegalInstruction error on Hopper GPUs in unusual scenarios. This could be ignored. This issue has been fixed.
  • There was an up to 53% engine build time increase for ViT networks in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
  • There was an up to 93% engine build time increase for BERT networks in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
  • There was an up to 22% performance drop for GPT2 networks in FP16 precision with kv-cache stacked together compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
  • There was an up to 28% performance regression compared to TensorRT 8.5 on Transformer networks in FP16 precision on NVIDIA Volta GPUs, and up to 85% performance regression on NVIDIA Pascal GPUs. Disabling the kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 preview flag was a workaround. This issue has been fixed.
  • There was an issue on NVIDIA A2 GPUs due to a known hang that can impact multiple networks. This issue has been fixed.
  • When using hardware compatibility features, TensorRT would potentially fail while compiling transformer based networks such as BERT. This issue has been fixed.
  • The ONNX parser incorrectly used the InstanceNormalization plugin when processing ONNX InstanceNormalization layers with FP16 scale and bias inputs, which led to unexpected NaN and infinite outputs. This issue has been fixed.
  • There was an up to 12% performance drop for WaveRNN networks in TF32 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs.
  • There was an up to 6% performance regression compared to TensorRT 8.6 on BART networks with kv-cache stacked together in FP16 precision on H100 GPUs.
  • A RHEL/CentOS 7 RPM package built with CUDA 12.2 for cuDNN 8.9.4 was not published to the CUDA network repo at the time of the TensorRT 9.0 release. When installing TensorRT on RHEL/CentOS 7 using the CUDA network repo, the cuDNN 8.9.4 RPM package built with CUDA 11.8 was installed instead. This has been resolved and the missing cuDNN 8.9.4 RPM package for RHEL/CentOS 7 and CUDA 12.2 has been published.
  • There was an up to 25% CPU memory usage increase for ViT networks when building the engine in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there was a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision. This issue has been fixed.
  • There was an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere Architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This issue has been fixed.
  • There was an up to 7% performance regression compared to TensorRT 8.5 on CortanaASR networks in FP16 precision on NVIDIA Volta GPUs. This issue has been fixed.
  • For enqueue-bound workloads, the latency of the workloads could have been longer in Windows than in Linux operating systems. This issue has been fixed.
  • Compared to TensorRT 8.6, TensorRT 9.0 had more aggressive multi-head attention (MHA) fusions. While this is beneficial in most cases, it caused up to 7% performance regression when the workload was too small. This issue has been fixed.
  • There may have been minor performance regressions when running ONNX models with InstanceNormalization operators in version compatible mode. This issue has been fixed.
  • There was an up to 8% performance regression compared to TensorRT 8.6 on PilotNet4 network on H100 GPUs. This issue has been fixed.
  • There was an up to 6% performance regression compared to TensorRT 8.6 on the SqueezeNet network in TF32 precision on H100 GPUs. This issue has been fixed.
  • TensorRT preserves precision for operations that are imported from ONNX models if using strongly typed networks.
  • When running ONNX networks with InstanceNormalization operations, there could have been up to a 50% decrease in performance. This issue has been fixed.

Known Issues

Functional
  • There is a known floating point exception when running BERT networks on H100 multi-instance GPU (MIG).
  • TensorRT does not preserve precision for operations that are imported from ONNX models if using weakly typed networks.
  • The compute sanitizer synccheck tool may report a false alarm when running conv and GEMM kernels with CGA size >= 4 on NVIDIA Hopper GPUs.
  • The compute sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.
  • When using cuDNN 8.9.2.26 and the TensorRT RNNv2 API, the compute sanitizer from CUDA Toolkit 12.2 may report race conditions in CUTLASS kernels.
  • Multihead attention fusion might not happen and affect performance if the number of heads is small.
  • If a ONNX model contains a Range operator and its limit input is a data-dependent tensor, engine building will likely fail.
  • Hardware forward compatibility (HFC) is broken on L4T Concord for ViT, Swin-Transformers, and BERT networks in FP16 mode. A workaround is to only use FP32 mode on L4T Concord or turn off HFC.
  • Compute Sanitizer from CUDA Toolkit 12.0/12.1 may report a false alarm about invalid memory access in generatedNativePointwise kernels.
  • If a network has a tensor of type bool with an implicitly data-dependent shape, engine building will likely fail.
  • There is an occurance of use-after-free in NVRTC that has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking with the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.
  • Although the version compatible runtime is optimized for efficiency, it may result in slower performance than the full runtime in certain use cases. Most networks can expect no more than a 10% slowdown when using a version-compatible engine compared to a version-locked engine. However, in some cases, a larger performance drop may occur. For example:
    • When running ResNet50_v2 with QAT, there may be up to a 22% decrease in performance.
    • When running DynUNet in FP16 precision, there may be up to a 32% decrease in performance.
  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
  • Hardware compatible engines built with CUDA versions older than 11.5 may crash during inference when run on a GPU with a compute capability lower than that of the GPU where the engine was built. A workaround is to build on the GPU with the lowest compute capability.
Performance
  • There is an up to 20 second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.
  • There is an up to 6% performance drop for BERT networks in FP32 precision compared to TensorRT 9.0 on NVIDIA Volta GPUs.
  • There are up to 21% peak GPU memory usage fluctuations when building the engine for the same network back to back due to different tactics being selected.
  • There is an up to 11% performance drop for ViT networks in TF32 precision compared to TensorRT 9.0 on NVIDIA Ampere GPUs.
  • There is an up to 12% performance drop for BERT networks in FP16 precision compared to TensorRT 9.0 on NVIDIA Ada Lovelace GPUs.
  • There is an up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert like models due to additional tactics available for evaluation.
  • There is an up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.
  • There is a known performance regression in the grouped deconvolution layer due to disabling cuDNN tactics. In TensorRT 8.6, performance can be recovered by unsetting nvinfer1::PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805. We will close the performance gap in a future release.
  • There is an up to 27% performance drop for the SegResNet model on Ampere GPUs compared to TensorRT 8.6 EA. This drop can be avoided by enabling the kVERSION_COMPATIBLE flag in the ONNX parser.
  • There is an up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.
  • Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.
  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.
  • There is an up to 30% performance regression for LSTM variants with dynamic shapes. This issue can be resolved by disabling the kFASTER_DYNAMIC_SHAPES_0805 preview feature in TensorRT 8.6.
  • There is a known issue on H100 that may lead to GPU hang when running TensorRT with high persistentCache usage. Limit the usage to 40% of L2 cache size as a workaround.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.
  • There is an up to 23% performance regression compared to TensorRT 8.5 on LSTMs in FP32 precision when dynamic shapes are used on NVIDIA Turing GPUs. Set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false as a workaround.
  • There is an up to 23% performance regression compared to TensorRT 8.5 on Temporal Fusion Transformers in FP32 precision on NVIDIA Turing and NVIDIA Ampere GPUs.
  • A higher builder optimization level does not always give a better performance when compared to a lower builder optimization level; which may happen on all platforms and up to 27%. The workaround is to build an engine using a lower builder optimization level.
  • There is an up to 8% performance regression compared to TensorRT 8.6 on PilotNet4 network on H100 GPUs. The regression will be fixed in a future TensorRT release.
  • There is an 1x FP32 model CPU memory leak from onnx.export tracked in issue 106976 for the TensorRT open source software HuggingFace demo. You may encounter higher peak CPU memory usage for fresh runs, but will drop for cached-ONNX runs.
  • There is an up to 6% performance regression compared to TensorRT 8.6 on the SqueezeNet network in TF32 precision on H100 GPUs.
  • There is an up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. To work around this, enable FP16 and disable INT8 in the builder config.
  • Compared to TensorRT 8.6, TensorRT 9.0 has more aggressive multi-head attention (MHA) fusions. While this is beneficial in most cases, it causes up to 7% performance regression when the workload is too small. Increasing batch size would help improve the performance.

1.3. TensorRT Release 9.0.1

These are the TensorRT 9.0.1 Release Notes and are applicable to x86 Linux users. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

This GA release is for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only. For applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1 for production use.

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • TensorRT 9.0 provides enhanced support for Large Language Models (LLMs), including the following networks:
    • GPT
    • GPT-J
    • GPT-Neo
    • GPT-NeoX
    • BART
    • Bloom
    • Bloomz
    • OPT
    • T5
    • FLAN-T5
  • Added support for bfloat16 data types on NVIDIA Ampere GPUs and newer architectures.
  • Added support for E4M3 FP8 data type on NVIDIA Hopper GPUs using explicit quantization. This allows utilizing TensorRT with TransformerEngine based FP8 models.
  • Added the NeMo demo into the TensorRT OSS repository to demonstrate the performance benefit of using E4M3 FP8 data type with the GPT models trained with the NVIDIA NeMo Toolkit and TransformerEngine.
  • Added bfloat16 and FP8 I/O datatypes in plugins.
  • Added support of networks running in INT8 precision where Smooth Quantization is used. Smooth Quantization is an approach that allows to improve accuracy of INT8 compute for LLM.
  • Added support for INT64 data type. The ONNX parser no longer automatically casts INT64 to INT32.
  • Added support for ONNX local functions when parsing ONNX models with the ONNX parser.
  • Added support for caching JIT-compiled code. It can be disabled by setting BuilderFlag::DISABLE_COMPILATION_CACHE. The compilation cache is part of the timing cache, caches JIT-compiled code, and will be serialized as part of the timing cache by default, which may significantly increase the cache size.
  • Added NetworkDefinitionCreationFlag::kSTRONGLY_TYPED. A strongly typed network faithfully translates the type specification of the network definition to the built engine.
  • Added two new TensorRT samples; sampleProgressMonitor (C++) and simple_progress_reporter (Python) that are examples for using Progress Monitor during engine build.
  • The following C++ API classes were added:
    • IProgressMonitor
  • The following C++ API methods were added:
    • INetworkDefinition::IFillLayer* addFill(Dims dimensions, FillOperation op, DataType outputType)
    • INetworkDefinition::IQuantizeLayer* addQuantize(ITensor& input, ITensor& scale, DataType
              outputType)
    • INetworkDefinition::IDequantizeLayer* addDequantize(ITensor& input, ITensor& scale,
              DataType outputType)
    • IBuilder::setProgressMonitor(IProgressMonitor* monitor)
    • IBuilder::getProgressMonitor()
    • IFillLayer::setAlphaInt64(int64_t alpha)
    • IFillLayer::getAlphaInt64()
    • IFillLayer::setBetaInt64(int64_t beta)
    • IFillLayer::getBetaInt64()
    • IFillLayer::isAlphaBetaInt64()
    • IFillLayer::getToType()
    • IFillLayer::setToType(DataType toType)
    • IQuantizeLayer::getToType()
    • IQuantizeLayer::setToType(DataType toType)
    • IDequantizeLayer::getToType()
    • IDequantizeLayer::setToType(DataType toType)
    • INetworkDefinition::getFlags()
    • INetworkDefinition::getFlag(NetworkDefinitionCreationFlag
              networkDefinitionCreationFlag)
  • The following C++ API enums were added:
    • IBuilderFlag::kFP8
    • IBuilderFlag::kERROR_ON_TIMING_CACHE_MISS
    • IBuilderFlag::kBF16
    • IBuilderFlag::kDISABLE_COMPILATION_CACHE
    • DataType::kBF16
    • DataType::kINT64
    • NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
  • The following Python API classes were added:
    • IProgressMonitor
  • The following Python API methods were added:
    • INetworkDefinition.add_fill(shape, op, output_type)
    • INetworkDefinition.add_dequantize(input, scale, output_type)
    • INetworkDefinition.add_quantize(input, scale, output_type)
    • INetworkDefinition.get_flag
  • The following Python API attributes were added:
    • IBuilder.progress_monitor
    • IFillLayer.is_alpha_beta_int64
    • IFillLayer.to_type
    • IQuantizeLayer.to_type
    • IDequantizeLayer.to_type
    • INetworkDefinition.flags
  • The following Python API enums were added:
    • IBuilderFlag.FP8
    • IBuilderFlag.ERROR_ON_TIMING_CACHE_MISS
    • IBuilderFlag.BF16
    • IBuilderFlag.DISABLE_COMPILATION_CACHE
    • DataType.BF16
    • DataType.INT64
    • NetworkDefinitionCreationFlag.STRONGLY_TYPED
  • The IEngineInspector now prints more detailed layer information for LSTMs and Transformers networks.

Breaking API Changes

  • Attention: In TensorRT 9.0, due to the introduction of INT64 as a supported data type, ONNX models with INT64 I/O require INT64 bindings. Note that prior to this release, such models required INT32 bindings.
  • In TensorRT 9.0 we are removing ICaffeParser, IUffParser, and related classes and functions. The following APIs are removed:
    • nvcaffeparser1::IBlobNameToTensor
    • nvcaffeparser1::IBinaryProtoBlob
    • nvcaffeparser1::IPluginFactoryV2
    • nvcaffeparser1::ICaffeParser
    • nvcaffeparser1::createCaffeParser
    • nvcaffeparser1::shutdownProtobufLibrary
    • createNvCaffeParser_INTERNAL
    • nvinfer1::utils::reshapeWeights
    • nvinfer1::utils::reorderSubBuffers
    • nvinfer1::utils::ransposeSubBuffers
    • nvuffparser::UffInputOrder
    • nvuffparser::FieldType
    • nvuffparser::FieldMap
    • nvuffparser::FieldCollection
    • nvuffparser::IUffParser
    • nvuffparser::createUffParser
    • nvuffparser::shutdownProtobufLibrary
    • createNvUffParser_INTERNAL
  • With removal of ICaffeParser and IUffParsers, the libnvparsers library is removed.
  • uff, graphsurgeon, and related networks are removed from TensorRT packages.

Deprecated API Lifetime

  • APIs deprecated in TensorRT 9.0 will be retained until at least 8/2024.
  • APIs deprecated in TensorRT 8.6 will be retained until at least 2/2024.
  • APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.
  • APIs deprecated in TensorRT 8.4 or before will be removed in TensorRT 10.0.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • The DLA compiler is capable of removing identity transposes, but it cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshape). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshapes in between. The shuffle layer is translated into two consecutive DLA transpose layers, unless you merge the transposes together manually in the model definition in advance.
  • In explicitly quantized networks, a group convolution that has a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, on NVIDIA Hopper™ it may fall back to FP32-IN-FP32-OUT if the input channel count is small. This will be fixed in a future release.
  • On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and PyTorch are unable to run due to missing Python module dependencies. These frameworks have not been built for PowerPC and/or published to standard repositories.
  • nvinfer1::UnaryOperation::kROUND or nvinfer1::UnaryOperation::kSIGN operations of IUnaryLayer are not supported in the implicit batch mode.
  • For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset that contains the corresponding function ops, for example: opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to corresponding implementation with primitive ops for normalization layers.
  • QuantizeLayer and DequantizeLayer only support FP32 scale and data, even when using ONNX opset 19. If the input is not FP32, you must add a Cast to FP32 on the input to QuantizeLayer, and a Cast from FP32 at the output of DequantizeLayer.
  • EngineInspector::getLayerInformation may return incomplete JSON data for some engines produced by TensorRT 9.0. When this happens, TensorRT Engine Explorer cannot be used to analyze the engine or generate a graph of the engine layers.

Deprecated and Removed Features

The following features have been deprecated or removed in TensorRT 9.0. Some deprecations that were planned to be removed in 9.0, but have not yet been removed, may be removed in TensorRT 10.0.
  • Ubuntu 18.04 has reached end of life and is no longer supported by TensorRT starting with 9.0.
  • The following plugins were deprecated:
    • BatchedNMS_TRT
    • BatchedNMSDynamic_TRT
    • BatchTilePlugin_TRT
    • Clip_TRT
    • CoordConvAC
    • CropAndResize
    • EfficientNMS_ONNX_TRT
    • CustomGeluPluginDynamic
    • LReLU_TRT
    • NMSDynamic_TRT
    • NMS_TRT
    • Normalize_TRT
    • Proposal
    • SingleStepLSTMPlugin
    • SpecialSlice_TRT
    • Split
  • The following C++ API classes were deprecated:
    • NvUtils
  • The following C++ API methods were deprecated:
    • nvinfer1::INetworkDefinition::addFill(nvinfer1::Dims dimensions, nvinfer1::FillOperation
              op)
      - Only the 2-parameter version of this function is deprecated.
    • nvinfer1::INetworkDefinition::addDequantize(nvinfer1::ITensor &input, nvinfer1::ITensor
              &scale)
      - Only the 2-parameter version of this function is deprecated.
    • nvinfer1::INetworkDefinition::addQuantize(nvinfer1::ITensor &input, nvinfer1::ITensor
              &scale)
      - Only the 2-parameter version of this function is deprecated.
  • The following C++ API enums were deprecated:
    • nvinfer1::TacticSource::kCUBLAS_LT
    • nvonnxparser::OnnxParserFlag::kNATIVE_INSTANCENORM
  • The following Python API methods were deprecated:
    • INetworkDefinition.add_fill(shape, op)
    • INetworkDefinition.add_dequantize(input, scale)
  • The following Python API enums were deprecated:
    • TacticSource.CUBLAS_LT
    • OnnxParserFlag.NATIVE_INSTANCENORM

Fixed Issues

  • There was an up to 9% performance regression compared to TensorRT 8.5 on Yolov3 batch size 1 in FP32 precision on NVIDIA Ada Lovelace GPUs. This issue has been fixed.
  • There was an up to 13% performance regression compared to TensorRT 8.5 on GPT2 without kv-cache in FP16 precision when dynamic shapes are used on NVIDIA Volta and NVIDIA Ampere GPUs. The workaround was to set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false. This issue has been fixed.
  • For transformer-based networks such as BERT and GPT, TensorRT could consume CPU memory up to 2 times the model size during compilation plus any additional formats timed. This issue has been fixed.
  • In some cases, when using the OBEY_PRECISION_CONSTRAINTS builder flag and the required type was set to FP32, the network could fail with a missing tactic due to an incorrect optimization converting the output of an operation to FP16. This could be resolved by removing the OBEY_PRECISION_CONSTRAINTS option. This issue has been fixed.
  • TensorRT could fail on devices with small RAM and swap space. This could be resolved by ensuring the RAM and swap space is at least 7 times the size of the network. For example, at least 21 GB of combined CPU memory for a 3 GB network. This issue has been fixed.
  • If an IShapeLayer was used to get the output shape of an INonZeroLayer, engine building would likely fail. This issue has been fixed.
  • If a one-dimension INT8 input was used for a Unary or ElementWise operation, engine building would throw an internal error “Could not find any implementation for node”. This issue has been fixed.
  • Using the compute sanitizer racecheck tool could cause the process to be terminated unexpectedly. The root cause was a wrong false alarm. The issue could be bypassed with --kernel-regex-exclude kns=scudnn_winograd. This issue has been fixed.
  • TensorRT in FP16 mode did not perform cast operations correctly when only the output types were set, but not the layer precisions. This issue has been fixed.
  • There was a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform. This issue has been fixed.
  • When using DLA, INT8 convolutions followed by FP16 layers could cause accuracy degradation. In such cases, you could either change the convolution to FP16 or the subsequent layer to INT8. This issue has been fixed.
  • Using the compute sanitizer tool from CUDA 12.0 could report a cudaErrorIllegalInstruction error on Hopper GPUs in unusual scenarios. This could be ignored. This issue has been fixed.
  • There was an up to 53% engine build time increase for ViT networks in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
  • There was an up to 93% engine build time increase for BERT networks in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
  • There was an up to 22% performance drop for GPT2 networks in FP16 precision with kv-cache stacked together compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
  • There was an up to 28% performance regression compared to TensorRT 8.5 on Transformer networks in FP16 precision on NVIDIA Volta GPUs, and up to 85% performance regression on NVIDIA Pascal GPUs. Disabling the kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 preview flag was a workaround. This issue has been fixed.
  • There was an issue on NVIDIA A2 GPUs due to a known hang that can impact multiple networks. This issue has been fixed.
  • When using hardware compatibility features, TensorRT would potentially fail while compiling transformer based networks such as BERT. This issue has been fixed.
  • The ONNX parser incorrectly used the InstanceNormalization plugin when processing ONNX InstanceNormalization layers with FP16 scale and bias inputs, which led to unexpected NaN and infinite outputs. This issue has been fixed.

Known Issues

Functional
  • A RHEL/CentOS 7 RPM package built with CUDA 12.2 for cuDNN 8.9.4 has not been published to the CUDA network repo. When installing TensorRT 9.0 on RHEL/CentOS 7 using the CUDA network repo, the cuDNN 8.9.4 RPM package built with CUDA 11.8 will be installed instead. Often, this scenario remains functional, but it’s undefined behavior to mix libraries with different CUDA major versions. This will be resolved soon once the missing cuDNN 8.9.4 RPM package for RHEL/CentOS 7 and CUDA 12.2 has been published.
  • The compute sanitizer synccheck tool may report a false alarm when running conv and GEMM kernels with CGA size >= 4 on NVIDIA Hopper GPUs.
  • The compute sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.
  • When using cuDNN 8.9.2.26 and the TensorRT RNNv2 API, the compute sanitizer from CUDA Toolkit 12.2 may report race conditions in CUTLASS kernels.
  • Multihead attention fusion might not happen and affect performance if the number of heads is small.
  • If a ONNX model contains a Range operator and its limit input is a data-dependent tensor, engine building will likely fail.
  • Hardware forward compatibility (HFC) is broken on L4T Concord for ViT, Swin-Transformers, and BERT networks in FP16 mode. A workaround is to only use FP32 mode on L4T Concord or turn off HFC.
  • Compute Sanitizer from CUDA Toolkit 12.0/12.1 may report a false alarm about invalid memory access in generatedNativePointwise kernels.
  • If a network has a tensor of type bool with an implicitly data-dependent shape, engine building will likely fail.
  • There is an occurance of use-after-free in NVRTC that has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking with the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.
  • Although the version compatible runtime is optimized for efficiency, it may result in slower performance than the full runtime in certain use cases. Most networks can expect no more than a 10% slowdown when using a version-compatible engine compared to a version-locked engine. However, in some cases, a larger performance drop may occur. For example:
    • When running ResNet50_v2 with QAT, there may be up to a 22% decrease in performance.
    • When running DynUNet in FP16 precision, there may be up to a 32% decrease in performance.
    • When running ONNX networks with InstanceNormalization operations, there may be up to a 50% decrease in performance.
  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
  • Hardware compatible engines built with CUDA versions older than 11.5 may crash during inference when run on a GPU with a compute capability lower than that of the GPU where the engine was built. A workaround is to build on the GPU with the lowest compute capability.
  • When enabling the cuDNN tactic source manually, there is a potential memory leak from the cuDNN library. This issue will be fixed in a future cuDNN release.
Performance
  • There is an up to 12% performance drop for WaveRNN networks in TF32 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs.
  • There is an up to 25% CPU memory usage increase for ViT networks when building the engine in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs.
  • There is an up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.
  • There is a known performance regression in the grouped deconvolution layer due to disabling cuDNN tactics. In TensorRT 8.6, performance can be recovered by unsetting nvinfer1::PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805. We will close the performance gap in a future release.
  • There is an up to 27% performance drop for the SegResNet model on Ampere GPUs compared to TensorRT 8.6 EA. This drop can be avoided by enabling the kVERSION_COMPATIBLE flag in the ONNX parser.
  • There is an up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.
  • There may be minor performance regressions when running ONNX models with InstanceNormalization operators in version compatible mode. Refer to the NVIDIA TensorRT Developer Guide for more information.
  • Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.
  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.
  • There is an up to 30% performance regression for LSTM variants with dynamic shapes. This issue can be resolved by disabling the kFASTER_DYNAMIC_SHAPES_0805 preview feature in TensorRT 8.6.
  • There is a known issue on H100 that may lead to GPU hang when running TensorRT with high persistentCache usage. Limit the usage to 40% of L2 cache size as a workaround.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.
  • There is an up to 7% performance regression compared to TensorRT 8.5 on CortanaASR networks in FP16 precision on NVIDIA Volta GPUs. Disable the kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 preview flag as a workaround.
  • There is an up to 23% performance regression compared to TensorRT 8.5 on LSTMs in FP32 precision when dynamic shapes are used on NVIDIA Turing GPUs. Set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false as a workaround.
  • There is an up to 23% performance regression compared to TensorRT 8.5 on Temporal Fusion Transformers in FP32 precision on NVIDIA Turing and NVIDIA Ampere GPUs.
  • A higher builder optimization level does not always give a better performance when compared to a lower builder optimization level; which may happen on all platforms and up to 27%. The workaround is to build an engine using a lower builder optimization level.
  • There is an up to 8% performance regression compared to TensorRT 8.6 on PilotNet4 network on H100 GPUs. The regression will be fixed in a future TensorRT release.
  • There is an 1x FP32 model CPU memory leak from onnx.export tracked in issue 106976 for the TensorRT open source software HuggingFace demo. You may encounter higher peak CPU memory usage for fresh runs, but will drop for cached-ONNX runs.
  • For enqueue-bound workloads, the latency of the workloads may be longer in Windows than in Linux operating systems. Upgrade the driver or use CUDA graph to work around the issue. Refer to the documentation about enqueue-bound workloads for more detailed information.
  • There is an up to 6% performance regression compared to TensorRT 8.6 on the SqueezeNet network in TF32 precision on H100 GPUs.
  • There is an up to 6% performance regression compared to TensorRT 8.6 on BART networks with kv-cache stacked together in FP16 precision on H100 GPUs.
  • There is an up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. To work around this, enable FP16 and disable INT8 in the builder config.
  • Compared to TensorRT 8.6, TensorRT 9.0 has more aggressive multi-head attention (MHA) fusions. While this is beneficial in most cases, it causes up to 7% performance regression when the workload is too small. Increasing batch size would help improve the performance.

1.4. TensorRT Release 9.0.0 Early Access (EA)

These are the TensorRT 9.0.0 Early Access (EA) Release Notes and are applicable to x86 Linux users. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

This EA release is for early testing and feedback for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only. For production use of TensorRT, applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1.

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • TensorRT 9.0 provides enhanced support for Large Language Models (LLMs), including the following networks:
    • GPT
    • GPT-J
    • GPT-Neo
    • GPT-NeoX
    • BART
    • Bloom
    • Bloomz
    • OPT
    • T5
    • FLAN-T5
  • Added support for bfloat16 data types on NVIDIA Ampere GPUs and newer architectures.
  • Added support for E4M3 FP8 data type on NVIDIA Hopper GPUs using explicit quantization. This allows utilizing TensorRT with TransformerEngine based FP8 models.
  • Added the NeMo demo into the TensorRT OSS repository to demonstrate the performance benefit of using E4M3 FP8 data type with the GPT models trained with the NVIDIA NeMo Toolkit and TransformerEngine.
  • Added bfloat16 and FP8 I/O datatypes in plugins.
  • Added support of networks running in INT8 precision where Smooth Quantization is used. Smooth Quantization is an approach that allows to improve accuracy of INT8 compute for LLM.
  • Added support for INT64 data type. The ONNX parser no longer automatically casts INT64 to INT32.
  • Added support for ONNX local functions when parsing ONNX models with the ONNX parser.
  • Added support for caching JIT-compiled code. It can be disabled by setting BuilderFlag::DISABLE_COMPILATION_CACHE. The compilation cache is part of the timing cache, which caches JIT-compiled code and will be serialized as part of the timing cache by default, which may significantly increase the cache size.
  • Added NetworkDefinitionCreationFlag::kSTRONGLY_TYPED. A strongly typed network faithfully translates the type specification of the network definition to the built engine.
  • Added two new TensorRT samples; sampleProgressMonitor (C++) and simple_progress_reporter (Python) that are examples for using Progress Monitor during engine build.
  • The following C++ API classes were added:
    • IProgressMonitor
  • The following C++ API methods were added:
    • INetworkDefinition::addFill(Dims dimensions, FillOperation op, DataType
            outputType)
    • INetworkDefinition::IDequantizeLayer* addDequantize(ITensor& input, ITensor& scale,
              DataType outputType)
    • IBuilder::setProgressMonitor(IProgressMonitor* monitor)
    • IBuilder::getProgressMonitor()
    • IFillLayer::setAlphaInt64(int64_t alpha)
    • IFillLayer::getAlphaInt64()
    • IFillLayer::setBetaInt64(int64_t beta)
    • IFillLayer::getBetaInt64()
    • IFillLayer::isAlphaBetaInt64()
    • IFillLayer::getToType()
    • IFillLayer::setToType(DataType toType)
    • IQuantizeLayer::getToType()
    • IQuantizeLayer::setToType(DataType toType)
    • IDequantizeLayer::getToType()
    • IDequantizeLayer::setToType(DataType toType)
    • INetworkDefinition::getFlags()
    • INetworkDefinition::getFlag(NetworkDefinitionCreationFlag
              networkDefinitionCreationFlag)
  • The following C++ API enums were added:
    • IBuilderFlag::kFP8
    • IBuilderFlag::kERROR_ON_TIMING_CACHE_MISS
    • IBuilderFlag::kBF16
    • IBuilderFlag::kDISABLE_COMPILATION_CACHE
    • DataType::kBF16
    • DataType::kINT64
    • NetworkDefinitionCreationFlag::kSTRONGLY_TYPED
  • The following Python API classes were added:
    • IProgressMonitor
  • The following Python API methods were added:
    • INetworkDefinition.add_fill(shape, op, output_type)
    • INetworkDefinition.add_dequantize(input, scale, output_type)
    • INetworkDefinition.add_quantize(input, scale, output_type)
    • INetworkDefinition.get_flag
  • The following Python API attributes were added:
    • IBuilder.progress_monitor
    • IFillLayer.is_alpha_beta_int64
    • IFillLayer.to_type
    • IQuantizeLayer.to_type
    • IDequantizeLayer.to_type
    • INetworkDefinition.flags
  • The following Python API enums were added:
    • IBuilderFlag.FP8
    • IBuilderFlag.ERROR_ON_TIMING_CACHE_MISS
    • IBuilderFlag.BF16
    • IBuilderFlag.DISABLE_COMPILATION_CACHE
    • DataType.BF16
    • DataType.INT64
    • NetworkDefinitionCreationFlag.STRONGLY_TYPED
  • The IEngineInspector now prints more detailed layer information for LSTMs and Transformers networks.

Breaking API Changes

  • Attention: In TensorRT 9.0, due to the introduction of INT64 as a supported data type, ONNX models with INT64 I/O require INT64 bindings. Note that prior to this release, such models required INT32 bindings.
  • In TensorRT 9.0 we are removing ICaffeParser, IUffParser, and related classes and functions. The following APIs are removed:
    • nvcaffeparser1::IBlobNameToTensor
    • nvcaffeparser1::IBinaryProtoBlob
    • nvcaffeparser1::IPluginFactoryV2
    • nvcaffeparser1::ICaffeParser
    • nvcaffeparser1::createCaffeParser
    • nvcaffeparser1::shutdownProtobufLibrary
    • createNvCaffeParser_INTERNAL
    • nvinfer1::utils::reshapeWeights
    • nvinfer1::utils::reorderSubBuffers
    • nvinfer1::utils::ransposeSubBuffers
    • nvuffparser::UffInputOrder
    • nvuffparser::FieldType
    • nvuffparser::FieldMap
    • nvuffparser::FieldCollection
    • nvuffparser::IUffParser
    • nvuffparser::createUffParser
    • nvuffparser::shutdownProtobufLibrary
    • createNvUffParser_INTERNAL
  • With removal of ICaffeParser and IUffParsers, the libnvparsers library is removed.
  • uff, graphsurgeon, and related networks are removed from TensorRT packages.

Deprecated API Lifetime

  • APIs deprecated in TensorRT 9.0 will be retained until at least 8/2024.
  • APIs deprecated in TensorRT 8.6 will be retained until at least 2/2024.
  • APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.
  • APIs deprecated in TensorRT 8.4 or before will be removed in TensorRT 10.0.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • The DLA compiler is capable of removing identity transposes, but it cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshape). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshapes in between. The shuffle layer is translated into two consecutive DLA transpose layers, unless you merge the transposes together manually in the model definition in advance.
  • In explicitly quantized networks, a group convolution that has a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, on NVIDIA Hopper™ it may fall back to FP32-IN-FP32-OUT if the input channel count is small. This will be fixed in a future release.
  • On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and PyTorch are unable to run due to missing Python module dependencies. These frameworks have not been built for PowerPC and/or published to standard repositories.
  • nvinfer1::UnaryOperation::kROUND or nvinfer1::UnaryOperation::kSIGN operations of IUnaryLayer are not supported in the implicit batch mode.
  • For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset that contains the corresponding function ops, for example: opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to corresponding implementation with primitive ops for normalization layers.

Deprecated and Removed Features

The following features have been deprecated or removed in TensorRT 9.0.0. Some deprecations that were planned to be removed in 9.0.0, but have not yet been removed, may be removed in TensorRT 10.0.
  • Ubuntu 18.04 has reached end of life and is no longer supported by TensorRT starting with 9.0.
  • The following plugins were deprecated:
    • BatchedNMS_TRT
    • BatchedNMSDynamic_TRT
    • BatchTilePlugin_TRT
    • Clip_TRT
    • CoordConvAC
    • CropAndResize
    • EfficientNMS_ONNX_TRT
    • CustomGeluPluginDynamic
    • LReLU_TRT
    • NMSDynamic_TRT
    • NMS_TRT
    • Normalize_TRT
    • Proposal
    • SingleStepLSTMPlugin
    • SpecialSlice_TRT
    • Split
  • The following C++ API classes were deprecated:
    • NvUtils
  • The following C++ API methods were deprecated:
    • nvinfer1::INetworkDefinition::addFill(nvinfer1::Dims dimensions, nvinfer1::FillOperation
              op)
      - Only the 2-parameter version of this function is deprecated.
    • nvinfer1::INetworkDefinition::addDequantize(nvinfer1::ITensor &input, nvinfer1::ITensor
              &scale)
      - Only the 2-parameter version of this function is deprecated.
    • nvinfer1::INetworkDefinition::addQuantize(nvinfer1::ITensor &input, nvinfer1::ITensor
              &scale)
      - Only the 2-parameter version of this function is deprecated.
  • The following C++ API enums were deprecated:
    • nvinfer1::TacticSource::kCUBLAS_LT
    • nvonnxparser::OnnxParserFlag::kNATIVE_INSTANCENORM
  • The following Python API methods were deprecated:
    • INetworkDefinition.add_fill(shape, op)
    • INetworkDefinition.add_dequantize(input, scale)
  • The following Python API enums were deprecated:
    • TacticSource.CUBLAS_LT
    • OnnxParserFlag.NATIVE_INSTANCENORM

Fixed Issues

  • There was an up to 9% performance regression compared to TensorRT 8.5 on Yolov3 batch size 1 in FP32 precision on NVIDIA Ada Lovelace GPUs. This issue has been fixed.
  • There was an up to 13% performance regression compared to TensorRT 8.5 on GPT2 without kv-cache in FP16 precision when dynamic shapes are used on NVIDIA Volta and NVIDIA Ampere GPUs. The workaround was to set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false. This issue has been fixed.
  • For transformer-based networks such as BERT and GPT, TensorRT could consume CPU memory up to 2 times the model size during compilation plus any additional formats timed. This issue has been fixed.
  • In some cases, when using the OBEY_PRECISION_CONSTRAINTS builder flag and the required type was set to FP32, the network could fail with a missing tactic due to an incorrect optimization converting the output of an operation to FP16. This could be resolved by removing the OBEY_PRECISION_CONSTRAINTS option. This issue has been fixed.
  • TensorRT could fail on devices with small RAM and swap space. This could be resolved by ensuring the RAM and swap space is at least 7 times the size of the network. For example, at least 21 GB of combined CPU memory for a 3 GB network. This issue has been fixed.
  • If an IShapeLayer was used to get the output shape of an INonZeroLayer, engine building would likely fail. This issue has been fixed.
  • If a one-dimension INT8 input was used for a Unary or ElementWise operation, engine building would throw an internal error “Could not find any implementation for node”. This issue has been fixed.
  • Using the compute sanitizer racecheck tool could cause the process to be terminated unexpectedly. The root cause was a wrong false alarm. The issue could be bypassed with --kernel-regex-exclude kns=scudnn_winograd. This issue has been fixed.
  • TensorRT in FP16 mode did not perform cast operations correctly when only the output types were set, but not the layer precisions. This issue has been fixed.
  • There was a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform. This issue has been fixed.
  • When using DLA, INT8 convolutions followed by FP16 layers could cause accuracy degradation. In such cases, you could either change the convolution to FP16 or the subsequent layer to INT8. This issue has been fixed.
  • Using the compute sanitizer tool from CUDA 12.0 could report a cudaErrorIllegalInstruction error on Hopper GPUs in unusual scenarios. This could be ignored. This issue has been fixed.

Known Issues

Functional
  • ONNX models containing a large number of "Pad" layers have a potential for error accumulation. This can result in accuracy differences in inference between TensorRT and ONNX runtime.
  • Multihead attention fusion might not happen and affect performance if the number of heads is small.
  • If a ONNX model contains a Range operator and its limit input is a data-dependent tensor, engine building will likely fail.
  • Hardware forward compatibility (HFC) is broken on L4T Concord for ViT, Swin-Transformers, and BERT networks in FP16 mode. A workaround is to only use FP32 mode on L4T Concord or turn off HFC.
  • Compute Sanitizer from CUDA Toolkit 12.0/12.1 may report a false alarm about invalid memory access in generatedNativePointwise kernels.
  • If a network has a tensor of type bool with an implicitly data-dependent shape, engine building will likely fail.
  • When using hardware compatibility features, TensorRT can potentially fail while compiling transformer based networks such as BERT. This issue will be fixed in the 8.6.1 release.
  • There is an occurance of use-after-free in NVRTC that has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking with the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.
  • Although the version compatible runtime is optimized for efficiency, it may result in slower performance than the full runtime in certain use cases. Most networks can expect no more than a 10% slowdown when using a version-compatible engine compared to a version-locked engine. However, in some cases, a larger performance drop may occur. For example:
    • When running ResNet50_v2 with QAT, there may be up to a 22% decrease in performance.
    • When running DynUNet in FP16 precision, there may be up to a 32% decrease in performance.
    • When running ONNX networks with InstanceNormalization operations, there may be up to a 50% decrease in performance.
  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
  • Hardware compatible engines built with CUDA versions older than 11.5 may crash during inference when run on a GPU with a compute capability lower than that of the GPU where the engine was built. A workaround is to build on the GPU with the lowest compute capability.
  • When enabling the cuDNN tactic source manually, there is a potential memory leak from the cuDNN library. This issue will be fixed in a future cuDNN release.
  • It is not recommended to use this specific version of TensorRT on NVIDIA A2 GPUs due to a known hang that can impact multiple networks. A fix for this issue is planned and will be included in a future release of TensorRT.
Performance
  • There is an up to 12% performance drop for WaveRNN networks in TF32 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs.
  • There is an up to 25% CPU memory usage increase for ViT networks when building the engine in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs.
  • There is an up to 53% engine build time increase for ViT networks in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs.
  • There is an up to 93% engine build time increase for BERT networks in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs.
  • There is an up to 22% performance drop for GPT2 networks in FP16 precision with kv-cache stacked together compared to TensorRT 8.6 on NVIDIA Ampere GPUs.
  • There is an up to 28% performance regression compared to TensorRT 8.5 on Transformer networks in FP16 precision on NVIDIA Volta GPUs, and up to 85% performance regression on NVIDIA Pascal GPUs. Disable the kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 preview flag as a workaround.
  • There is an up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.
  • There is a known performance regression in the grouped deconvolution layer due to disabling cuDNN tactics. In TensorRT 8.6, performance can be recovered by unsetting nvinfer1::PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805. We will close the performance gap in a future release.
  • There is an up to 27% performance drop for the SegResNet model on Ampere GPUs compared to TensorRT 8.6 EA. This drop can be avoided by enabling the kVERSION_COMPATIBLE flag in the ONNX parser.
  • There is an up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.
  • There may be minor performance regressions when running ONNX models with InstanceNormalization operators in version compatible mode. Refer to the NVIDIA TensorRT Developer Guide for more information.
  • Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.
  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.
  • There is an up to 30% performance regression for LSTM variants with dynamic shapes. This issue can be resolved by disabling the kFASTER_DYNAMIC_SHAPES_0805 preview feature in TensorRT 8.6.
  • There is a known issue on H100 that may lead to GPU hang when running TensorRT with high persistentCache usage. Limit the usage to 40% of L2 cache size as a workaround.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.
  • There is an up to 7% performance regression compared to TensorRT 8.5 on CortanaASR networks in FP16 precision on NVIDIA Volta GPUs. Disable the kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 preview flag as a workaround.
  • There is an up to 23% performance regression compared to TensorRT 8.5 on LSTMs in FP32 precision when dynamic shapes are used on NVIDIA Turing GPUs. Set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false as a workaround.
  • There is an up to 23% performance regression compared to TensorRT 8.5 on Temporal Fusion Transformers in FP32 precision on NVIDIA Turing and NVIDIA Ampere GPUs.

TensorRT Release 8.x.x

2.1. TensorRT Release 8.6.1

These are the TensorRT 8.6.1 Release Notes and are applicable to x86 Linux and Windows users. This release incorporates Arm®-based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Announcements

  • In TensorRT 8.6, cuDNN, cuBLAS, and cuBLASLt tactic sources are turned off by default in builder profiling. TensorRT plans to remove the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. If there are critical regressions, set the PreviewFeature flag kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to false to re-enable cuDNN, cuBLAS, and cuBLASLt.
  • Stubbed static libraries for cuDNN, cuBLAS, and cuBLASLt are provided in $(TRT_LIB_DIR)/stubs. When statically linking TensorRT with no requirement for cuDNN, cuBLAS, or cuBLASLt, the stubbed library can be used to reduce CPU and GPU memory usage.
  • Debian and RPM packages are now using 4-component versioning instead of 3 to better identify differences between TensorRT builds.
  • The tar and zip filenames no longer contain the cuDNN version due to cuDNN no longer being a primary tactic source for TensorRT. cuDNN has a lesser impact on TensorRT’s performance than in previous releases where the cuDNN version was prominent in the package filename.
  • The TensorRT Python Package Index installation has been split into multiple modules:
    • TensorRT libraries (tensorrt_libs)
    • Python bindings matching the Python version in use (tensorrt_bindings)
    • Frontend source package, which pulls in the correct version of dependent TensorRT modules from pypi.nvidia.com (tensorrt)

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • Python 3.11 is now supported starting with TensorRT 8.6 GA.
  • CUDA 12.x is now supported starting with the TensorRT 8.6 release. CUDA 11.x builds from this release or a previous release are not compatible with libraries or applications compiled with CUDA 12.x and may lead to unexpected behavior.
  • Added support for hardware compatibility. This allows an engine built on one GPU architecture to work on GPUs of other architectures. This feature is only supported for NVIDIA Ampere and newer architectures. Note that enabling hardware compatibility may result in a degradation in latency/throughput, and is therefore intended to be used primarily to ease the process of upgrading to new hardware. This feature is not supported on JetPack.
  • Added the following layers:
    • IReverseSequence layer has been added to support the ReverseSequence operator in ONNX.
    • INormalization layer has been added to support InstanceNormalization, GroupNormalization, and LayerNormalization operations in ONNX.
  • A new nvinfer1::ICastLayer interface was introduced, which provides a conversion of the data type of the input tensor between FP32, FP16, INT32, INT8, UINT8, and BOOL. The ONNX parser was updated to use ICastLayer instead of IIdentityLayer to implement cast.
  • Introduced restricted runtime installation options when memory consumption is a high priority for deployment: lean or dispatch runtime mode.
    • Lean runtime installation: This installation is significantly smaller than the full installation and allows you to load and run engines that were built with a version compatible builder flag. This installation will not provide the functionality to build a TensorRT plan file.
    • Dispatch runtime installation: This installation allows for deployments with the minimum memory consumption and allows you to load and run engines that were built with a version compatible builder flag and include the lean runtime. This installation does not provide the functionality to build a TensorRT plan file.
    • Added version compatibility support to trtexec. Refer to the NVIDIA TensorRT Developer Guide for more information on the trtexec flags.

    For more information, refer to the NVIDIA TensorRT Installation Guide.

  • By default, TensorRT engines are compatible only with the version of TensorRT with which they are built. Starting in TensorRT 8.6, with the appropriate build-time configuration, engines can be built that are compatible with other TensorRT minor versions within a major version. For more information, refer to Version Compatibility.
    Note: Use of certain version compatibility features requires explicit configuration of the TensorRT runtime to allow host executable code. For more information, refer to Runtime Options.
  • Implemented the following performance enhancements:
    • Improved the Multi-Head Attention (MHA) fusions, speeding up Transformer-like networks.
    • Improved the engine building time and the performance of Transformer-like networks with dynamic shapes.
    • Avoided unnecessary cuStreamSynchronize() calls in enqueueV2() and enqueueV3() when running LSTMs or Transformer-like networks.
    • Improved the performance of various networks on NVIDIA Hopper GPUs.
  • Added an optimization level builder flag, which allows TensorRT to spend more engine building time searching for better tactics, or to build the engine much faster by reducing the searching scope. For more information, refer to Builder Optimization Level.
  • Added the multi-stream APIs, which allows users to control how many streams TensorRT can use to run different parts of the network in parallel, potentially resulting in better performance. For more information, refer to Within-Inference Multi-Streaming.
  • Experimental. Extended DLA so that IElementWiseLayer now supports equal operation (ElementWiseOperation::kEQUAL). This is the first ElementWise logical operation that DLA supports. There are several restrictions and requirements imposed when adopting this operation in DLA. Refer to DLA Supported Layers for more information.
    • One such requirement is that you must explicitly set the device type of the ElementWise equal layer to DLA. To enable this feature on trtexec, it now supports a flag --layerDeviceTypes to let you explicitly specify the device type for individual layers. Refer to Commonly Used Command-line Flags for more information on the new flag.
  • Added a new sample called onnx_custom_plugin, which demonstrates how to use plugins written in C++ to run TensorRT on ONNX models with custom or unsupported layers. For specifics about this sample, refer to the GitHub: /onnx_custom_plugin/README.md file for detailed information about how this sample works, sample code, and step-by-step instructions on how to run and verify its output.
  • Made the following C++ API changes:
    • Added classes:
      • IReverseSequenceLayer
      • INormalizationLayer
      • ILoggerFinder
    • Added macros:
      • NV_TENSORRT_RELEASE_TYPE
      • NV_TENSORRT_RELEASE_TYPE_EARLY_ACCESS
      • NV_TENSORRT_RELEASE_TYPE_RELEASE_CANDIDATE
      • NV_TENSORRT_RELEASE_TYPE_GENERAL_AVAILABILITY
    • Added functions:
      • getBuilderSafePluginRegistry()
      • IAlgorithmIOInfo::getVectorizedDim()
      • IAlgorithmIOInfo::getComponentsPerElement()
      • IBuilderConfig::setBuilderOptimizationLevel()
      • IBuilderConfig::getBuilderOptimizationLevel()
      • IBuilderConfig::setHardwareCompatibilityLevel()
      • IBuilderConfig::getHardwareCompatibilityLevel()
      • IBuilderCOnfig::setPluginsToSerialize()
      • IBuilderConfig::getPluginToSerialize()
      • IBuilderConfig::getNbPluginsToSerialize()
      • IBuilderConfig::getMaxAuxStreams()
      • IBuilderConfig::setMaxAuxStreams()
      • IBuilder::getPluginRegistry()
      • ICudaEngine::getHardwareCompatibilityLevel()
      • ICudaEngine::getNbAuxStreams()
      • IExecutionContext::setAuxStreams()
      • ILayer::setMetadata()
      • ILayer::getMetadata()
      • INetworkDefinition::addCast()
      • INetworkDefinition::addNormalization()
      • INetworkDefinition::addReverseSequence()
      • INetworkDefinition::getBuilder()
      • IPluginRegistry::isParentSearchEnabled()
      • IPluginRegistry::setParentSearchEnabled()
      • IPluginRegistry::loadLibrary()
      • IPluginRegistry::deregisterLibrary()
      • IRuntime::setTemporaryDirectory()
      • IRuntime::getTemporaryDirectory()
      • IRuntime::setTempfileControlFlags()
      • IRuntime::getTempfileControlFlags()
      • IRuntime::getPluginRegistry()
      • IRuntime::setPluginRegistryParent()
      • IRuntime::loadRuntime()
      • IRuntime::setEngineHostCodeAllowed()
      • IRuntime::getEngineHostCodeAllowed()
      • ITopKLayer::setInput()
    • Added enums:
      • HardwareCompatibilityLevel
      • TempfileControlFlag
    • Added enum values:
      • BuilderFLag::kVERSION_COMPATIBLE
      • BuilderFlag::kEXCLUDE_LEAN_RUNTIME
      • DataType::kFP8
      • LayerType::kREVERSE_SEQUENCE
      • LayerType::kNORMALIZATION
      • LayerType::kCAST
      • MemoryPoolType::kTACTIC_DRAM
      • PreviewFeature::kPROFILE_SHARING_0806
      • TensorFormat::kDHWC
      • UnaryOperation::kISINF
    • Deprecated enums:
      • PreviewFeature::kFASTER_DYNAMIC_SHAPES
    • Deprecated macros:
      • NV_TENSORRT_SONAME_MAJOR
      • NV_TENSORRT_SONAME_MINOR
      • NV_TENSORRT_SONAME_PATCH
    • Deprecated funtions:
      • FieldMap::FieldMap()
      • IAlgorithmIOInfo::getTensorFormat()
  • Made the following Python API changes:
    • Added classes:
      • ICastLayer
      • IReverseSequenceLayer
      • INormalizationLayer
    • Added functions:
      • IExecutionContext.set_aux_streams()
      • INetworkDefinition.add_cast()
      • INetworkDefinition.add_normalization()
      • INetworkDefinition.add_reverse_sequence()
      • IPluginRegistry.load_library()
      • IPluginRegistry.deregister_library()
      • IRuntime.load_runtime()
      • ITopKLayer.set_input()
    • Added properties:
      • BuilderFlag.EXCLUDE_LEAN_RUNTIME
      • BuilderFlag.FP8
      • BuilderFlag.VERSION_COMPATIBLE
      • DataType.FP8
      • HardwareCompatibilityLevel.AMPERE_PLUS
      • HardwareCompatibilityLevel.NONE
      • IAlgorithmIOInfo.components_per_element
      • IAlgorithmIOInfo.vectorized_dim
      • IBuilder.get_plugin_registry
      • IBuilderConfig.builder_optimization_level
      • IBuilderConfig.hardware_compatibility_level
      • IBuilderConfig.max_aux_streams
      • IBuilderConfig.plugins_to_serialize
      • ICudaEngine.hardware_compatibility_level
      • ICudaEngine.num_aux_streams
      • ILayer.metadata
      • INetworkDefinition.builder
      • INormalizationLayer.axes
      • INormalizationLayer.compute_precision
      • INormalizationLayer.epsilon
      • INormalizationLayer.num_groups
      • IPluginRegistry.parent_search_enabled
      • IReverseSequenceLayer.batch_axis
      • IReverseSequenceLayer.sequence_axis
      • IRuntime.engine_host_code_allowed
      • IRuntime.load_runtime
      • IRuntime.tempfile_control_flags
      • IRuntime.temporary_directory
      • LayerType.CAST
      • LayerType.NORMALIZATION
      • LayerType.REVERSE_SEQUENCE
      • MemoryPoolType.TACTIC_DRAM
      • PreviewFeature.PROFILE_SHARING_0806
      • TempfileControlFlag.ALLOW_IN_MEMORY_FILES
      • TempfileControlFlag.ALLOW_TEMPORARY_FILES
      • TensorFormat.DHWC
    • Added enums:
      • HardwareCompatibilityLevel
      • TempfileControlFlag
    • Added enum values:
      • UnaryOperation.ISINF
    • Deprecated enums:
      • PreviewFeature.FASTER_DYNAMIC_SHAPES
    • Deprecated properties:
      • IAlgorithmIOInfo.tensor_format

Deprecated API Lifetime

  • APIs deprecated in TensorRT 8.6 will be retained until at least 2/2024.
  • APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.
  • APIs deprecated before TensorRT 8.4 will be removed in TensorRT 9.0.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • The DLA compiler is capable of removing identity transposes, but it cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshape). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshapes in between. The shuffle layer is translated into two consecutive DLA transpose layers, unless you merge the transposes together manually in the model definition in advance.
  • In explicitly quantized networks, a group convolution that has a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, on NVIDIA Hopper™ it may fall back to FP32-IN-FP32-OUT if the input channel count is small. This will be fixed in a future release.
  • On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and PyTorch are unable to run due to missing Python module dependencies. These frameworks have not been built for PowerPC and/or published to standard repositories.
  • TensorRT 8.6 adds nvinfer1::BuilderFlag::kFP8 and nvinfer1::DataType::kFP8 to the public API as preparation for the introduction of FP8 support in future TensorRT releases. Despite these, FP8 (8-bit floating point) is not supported by TensorRT and attempting to use FP8 will result in an error or undefined behavior.
  • nvinfer1::UnaryOperation::kROUND or nvinfer1::UnaryOperation::kSIGN operations of IUnaryLayer are not supported in the implicit batch mode.
  • For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset that contains the corresponding function ops, for example: opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to corresponding implementation with primitive ops for normalization layers.

Deprecated and Removed Features

The following features are deprecated in TensorRT 8.6.1:
  • Support for CUDA Toolkit 10.2 has been dropped.
  • TensorRT 8.5.3 was the last release supporting NVIDIA Kepler (SM 3.x) and NVIDIA Maxwell (SM 5.x) devices. These devices are no longer supported in TensorRT 8.6. NVIDIA Pascal (SM 6.x) devices are deprecated in TensorRT 8.6.
  • The NvInferRuntimeCommon.h file is being deprecated starting in TensorRT 8.6.0, and will be dropped in TensorRT 9.0. Instead, automotive safety users should include NvInferSafeRuntime.h. All other users should include NvInferRuntime.h.
  • Removed the following samples:
    • C++ samples:
      • sampleSSD
      • sampleUffFasterRCNN
      • sampleUffMNIST
      • sampleUffMaskRCNN
      • sampleUffPluginV2Ext
      • sampleUffSSD
      • sampleMNIST
      • sampleFasterRCNN
      • sampleGoogleNet
      • sampleINT8
      • sampleMNISTAPI
    • Python samples:
      • uff_ssd
      • uff_custom_plugin
      • end_to_end_tensorflow_mnist
      • engine_refit_mnist
      • int8_caffe_mnist
    • introductory_parser_samples - removed from UFF and Caffe options
  • The trtexec argument --buildOnly has been deprecated and was replaced by the argument --skipInference. The argument was renamed to better clarify the intention behind the argument.

Fixed Issues

  • Fixed an up to 20% performance variation between different engines built from the same network for some LSTM networks due to unstable tactic selections. The tactic selection stability has been improved in this release.
  • The r525 or later drivers contain the fix for an up to 11% performance variation for some LSTM networks during inference, depending on the order of CUDA stream creation on NVIDIA Turing GPUs.
  • Fixed the issue that TensorRT might output wrong results when there are GEMM, Conv, and MatMul ops followed by a Reshape op.
  • Improved the H100 performance for some ConvNets in TF32 precision.
  • Improved the H100 performance for some Transformers in FP16 precision.
  • Improved the H100 performance for some 3DUNets.
  • Fixed an up to 6% performance drop for OpenRoadNet networks in TF32 precision compared to TensorRT 8.4 on NVIDIA Ampere architecture GPUs.
  • Fixed an up to 5% performance drop for UNet networks in INT8 precision with explicit quantization on CUDA 11.x compared to CUDA 10.2 on NVIDIA Turing GPUs.
  • Fixed an up to 16% performance drop for LSTM networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Pascal GPUs.
  • Fixed an up to 6% performance drop for ResNeXt-50 QAT networks in INT8, FP16, and FP32 precision at batch-size = 1 compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • H100 performance for some ConvNets containing depthwise convolutions (like QuartzNets and EfficientDet-D0) in INT8 precision was not fully optimized. This issue has been fixed in this release.
  • There was a ~12% performance drop on NVIDIA Ampere architecture GPUs for the BERT network on Windows systems. This issue has been fixed in this release.
  • There was an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs. This issue has been fixed in this release.
  • There was an up to 6% performance drop for WaveRNN networks in FP16 precision compared to TensorRT 8.4 on CUDA 11.8 on NVIDIA Volta GPUs. This issue has been fixed in this release.
  • There was an up to 13% performance drop for Megatron networks in FP16 precision on V100 T4 GPUs when disableExternalTacticSourcesForCore0805 was enabled. This issue has been fixed in this release.
  • Fixed the issue that for some QAT models, when FP16 is enabled and a foreign node is created, if a tensor is the output of the foreign node and also serves as input to another node inside the subgraph of the foreign node, TensorRT may report an error with the following message for the node:
    [W] [TRT] Skipping tactic 0x0000000000000000 due to Myelin error: Concat operation "XXX" has different types of operands.
  • Resolved an issue with printing Unicode escape sequences while writing JSON files. This fix addresses previous parsing issues with JSON files in the TREx tool.
  • The DqQFusion would sometimes generate a broken fused node with an empty scale parameter, which violated the following PointWiseFusion’s assertion. This issue has been fixed in this release
  • Explicit quantization on convolution with dynamic weights would fail to build on some platforms like Windows. This issue has been fixed in this release.
  • There was an up to 10% performance regression for WaveRNN on Turing GPUs in FP16 precision. This issue has been fixed in this release.
  • There was an up to 72% increase in GPU memory usage on H100 for various networks. This issue has been fixed in this release.
  • There was an up to 31% performance drop for DeepASR on Turing GPUs in FP16 precision compared to TensorRT 8.5.1. This issue has been fixed in this release.
  • There was an up to 12% performance drop for BERT on H100 in FP16 precision compared to TensorRT 8.5.0 EA. This issue has been fixed in this release.
  • There was an up to 11% performance regression compared to TensorRT 8.5 on LSTM networks in FP16 precision on NVIDIA Ampere GPUs. This issue has been fixed in this release.
  • The TensorRT engine might fail to build under QAT mode if paired Quantization and Dequantization layers were not adjacent to each other. This issue has been fixed in this release.
  • The computation to determine the required number of resize dimensions has been improved; resolving an issue with missing quantized resize implementation where support for the number of resize dimensions is limited to 2.
  • When using the algorithm selector API, the HWC1 and HWC4 DLA formats were both reported as TensorFormat::kDLA_HWC4. This issue has been fixed in this release.
  • For some networks, containing matrix multiplication operations on A100, using TF32 could cause accuracy degradation. Disabling TF32 was the workaround. This issue has been fixed in this release.
  • TensorRT engine building could fail if the users added an unnamed layer. This is because TensorRT could generate duplicated layer/tensor names for the unnamed layer. This issue has been fixed in this release.
  • HuggingFace demos could fail if the system CUDA version is older than the CUDA version used to build PyTorch. This is because the environment was using the cublasLT.so from the system CUDA installation instead of the one distributed by PyTorch. This issue has been fixed in this release.
  • There was a small accuracy drop for CortanaASR networks in TF32 precision. This issue has been fixed in this release.
  • Using the compute sanitizer racecheck tool from CUDA 12.0 could report read-after-write hazards in unusual scenarios. This issue has been fixed in this release.
  • There was a known issue with huge graphs that cause out of memory errors with specific input shapes even though a larger input shape can be run. This issue has been fixed in this release.
  • The Python sample yolov3_onnx had a known issue when installing the requirements with Python 3.10. The recommendation was to use a Python version < 3.10 when running the sample. This issue has been fixed in this release.
  • There was a known functionality issue when cross compiling TensorRT samples. The workaround was to set the TRT_LIB_DIR environment manually. This issue has been fixed in this release.
  • For networks such as Tacotron 2 decoder, where there was a Convolution operation within a loop body, TensorRT could potentially fail during compilation. This issue has been fixed in this release.
  • There was a known functional issue with thread safety when using multiple TensorRT builders concurrently. This issue has been fixed in this release.
  • TensorRT compiled for CUDA 11.4 could fail to compile a graph when there were GEMM ops followed by a gelu_erf op. This issue has been fixed in this release.
  • For transformer decoder-based models (such as GPT2) with sequence length as dynamic, TensorRT 8.5 required additional workspace (up to 2x) as compared to previous releases. This issue has been fixed in this release.
  • There was an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This issue has been fixed in this release.
  • With the kFASTER_DYNAMIC_SHAPES_0805 preview feature enabled on the GPT style decoder models, there could be an up to 20% performance regression for odd sequence lengths only compared to TensorRT without the use of the preview feature. This issue has been fixed in this release.
  • The auto-tuner assumed that the number of indices returned by INonZeroLayer was half of the number of input elements. Thus, networks that depended on tighter assumptions for correctness could fail to build. This issue has been fixed in this release.
  • There was an up to 15% performance drop for the CTFM model on V100 compared to TensorRT 8.5.1. This issue has been fixed in this release.
  • The tactic optimizer would sometimes still choose a builder that cannot stride according to the timing results when there was a builder for this layer that can stride. If the adjacent reformat node was eliminated in this case, this layer would give outputs with wrong strides. This issue has been fixed in this release.
  • For some QAT models, if convolution and pointwise fusion resulted in a multi-output layer with some output tensors quantized and others not, the building of the engine could fail with the following error message:
    [E] Error[2]: [optimizer.cpp::filterQDQFormats::4422] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. All of the candidates were removed, which points to the node being incorrectly marked as an int8 node.
    One workaround was to disable the kJIT_CONVOLUTIONS tactic source. This issue has been fixed in this release.

Known Issues

Functional
  • In some cases, when using the OBEY_PRECISION_CONSTRAINTS builder flag and the required type is set to FP32, the network can fail with a missing tactic due to an incorrect optimization converting the output of an operation to FP16. This can be resolved by removing the OBEY_PRECISION_CONSTRAINTS option.
  • TensorRT may fail on devices with small RAM and swap space. This can be resolved by ensuring the RAM and swap space is at least 7 times the size of the network. For example, at least 21 GB of combined CPU memory for a 3 GB network.
  • If an IShapeLayer is used to get the output shape of an INonZeroLayer, engine building will likely fail.
  • ONNX models containing a large number of "Pad" layers have a potential for error accumulation. This can result in accuracy differences in inference between TensorRT and ONNX runtime.
  • In rare cases, when using optimization level 5, you may see build failures with the error message
    “Skipping tactic 0x* due to exception Assertion index <= signedSize(shaders)
            failed”
    . A workaround solution is to use optimization level 4 instead.
  • If a one-dimension INT8 input is used for a Unary or ElementWise operation, engine building may throw an internal error “Could not find any implementation for node”.
  • Multihead attention fusion might not happen and affect performance if the number of heads is small.
  • If a ONNX model contains a Range operator and its limit input is a data-dependent tensor, engine building will likely fail.
  • Hardware forward compatibility (HFC) is broken on L4T Concord for ViT, Swin-Transformers, and BERT networks in FP16 mode. A workaround is to only use FP32 mode on L4T Concord or turn off HFC.
  • Compute Sanitizer from CUDA Toolkit 12.0/12.1 may report a false alarm about invalid memory access in generatedNativePointwise kernels.
  • Using the compute sanitizer racecheck tool may cause the process to be terminated unexpectedly. The root cause is a wrong false alarm. The issue can be bypassed with --kernel-regex-exclude kns=scudnn_winograd.
  • If a network has a tensor of type bool with an implicitly data-dependent shape, engine building will likely fail.
  • When using hardware compatibility features, TensorRT can potentially fail while compiling transformer based networks such as BERT. This issue will be fixed in the 8.6.1 release.
  • There is an occurance of use-after-free in NVRTC that has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking with the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.
  • Although the version compatible runtime is optimized for efficiency, it may result in slower performance than the full runtime in certain use cases. Most networks can expect no more than a 10% slowdown when using a version-compatible engine compared to a version-locked engine. However, in some cases, a larger performance drop may occur. For example:
    • When running ResNet50_v2 with QAT, there may be up to a 22% decrease in performance.
    • When running DynUNet in FP16 precision, there may be up to a 32% decrease in performance.
    • When running ONNX networks with InstanceNormalization operations, there may be up to a 50% decrease in performance.
  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
  • When using DLA, INT8 convolutions followed by FP16 layers may cause accuracy degradation. In such cases, either change the convolution to FP16 or the subsequent layer to INT8.
  • Using the compute sanitizer tool from CUDA 12.0 may report a cudaErrorIllegalInstruction error on Hopper GPUs in unusual scenarios. This can be ignored, and will be fixed in a future CUDA release.
  • Hardware compatible engines built with CUDA versions older than 11.5 may crash during inference when run on a GPU with a compute capability lower than that of the GPU where the engine was built. A workaround is to build on the GPU with the lowest compute capability.
  • When enabling the cuDNN tactic source manually, there is a potential memory leak from the cuDNN library. This issue will be fixed in a future cuDNN release.
Performance
  • There is an up to 28% performance regression compared to TensorRT 8.5 on Transformer networks in FP16 precision on NVIDIA Volta GPUs, and up to 85% performance regression on NVIDIA Pascal GPUs. Disable the kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 preview flag as a workaround.
  • There may be higher peak GPU memory usage when building the engine on NVIDIA Ampere GPUs compared to TensorRT 8.5.
  • There is an up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.
  • There is a known performance regression in the grouped deconvolution layer due to disabling cuDNN tactics. In TensorRT 8.6, performance can be recovered by unsetting nvinfer1::PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805. We will close the performance gap in a future release.
  • There is an up to 27% performance drop for the SegResNet model on Ampere GPUs compared to TensorRT 8.6 EA. This drop can be avoided by enabling the kVERSION_COMPATIBLE flag in the ONNX parser.
  • There is an up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.
  • There may be minor performance regressions when running ONNX models with InstanceNormalization operators in version compatible mode. Refer to the NVIDIA TensorRT Developer Guide for more information.
  • Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.
  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.
  • There is an up to 30% performance regression for LSTM variants with dynamic shapes. This issue can be resolved by disabling the kFASTER_DYNAMIC_SHAPES_0805 preview feature in TensorRT 8.6.
  • There is a known issue on H100 that may lead to GPU hang when running TensorRT with high persistentCache usage. Limit the usage to 40% of L2 cache size as a workaround.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 7 times the model size during compilation.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for T5 networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs due to a functionality fix.
  • There is an up to 17% performance drop for LSTM on Windows in FP16 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • There is an up to 7% performance drop for Artifact Reduction networks involving Deconvolution ops in INT8 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • There is an up to 16% performance regression on GPT2, T5, and Temporal-Fusion Transformers on NVIDIA Turing GPUs in FP32 precision due to a necessary accuracy fix. To recover the performance, enable FP16 precision.
  • There is an up to 9% performance regression compared to TensorRT 8.5 on Yolov3 batch size 1 in FP32 precision on NVIDIA Ada Lovelace GPUs.
  • There is an up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.
  • There is an up to 13% performance regression compared to TensorRT 8.5 on GPT2 without kv-cache in FP16 precision when dynamic shapes are used on NVIDIA Volta and NVIDIA Ampere GPUs. Set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false as a workaround.
  • There is an up to 7% performance regression compared to TensorRT 8.5 on CortanaASR networks in FP16 precision on NVIDIA Volta GPUs. Disable the kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 preview flag as a workaround.
  • There is an up to 23% performance regression compared to TensorRT 8.5 on LSTMs in FP32 precision when dynamic shapes are used on NVIDIA Turing GPUs. Set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false as a workaround.
  • There is an up to 23% performance regression compared to TensorRT 8.5 on Temporal Fusion Transformers in FP32 precision on NVIDIA Turing and NVIDIA Ampere GPUs.
  • There is an up to 13% performance regression compared to TensorRT 8.5 on Multi-Layer Perceptron networks in FP16 precision on NVIDIA Ampere GPUs. Set the kDISABLE_EXTERNAL_TACTIC_SORCES_FOR_CORE_0805 preview flag to false as a workaround.

2.2. TensorRT Release 8.6.0 Early Access (EA)

These are the TensorRT 8.6.0 Early Access (EA) Release Notes and are applicable to x86 Linux and Windows users. This release incorporates Arm®-based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

This EA release is for early testing and feedback. For production use of TensorRT, continue to use TensorRT 8.5.3.

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Announcements

  • In TensorRT 8.6, cuDNN, cuBLAS, and cuBLASLt tactic sources are turned off by default in builder profiling. TensorRT plans to remove the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. If there are critical regressions, set the PreviewFeature flag kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to false to re-enable cuDNN, cuBLAS, and cuBLASLt.
  • Stubbed static libraries for cuDNN, cuBLAS, and cuBLASLt are provided in $(TRT_LIB_DIR)/stubs. When statically linking TensorRT with no requirement for cuDNN, cuBLAS, or cuBLASLt, the stubbed library can be used to reduce CPU and GPU memory usage.
  • Debian and RPM packages are now using 4-component versioning instead of 3 to better identify differences between TensorRT builds.
  • The tar and zip filenames no longer contain the cuDNN version due to cuDNN no longer being a primary tactic source for TensorRT. cuDNN has a lesser impact on TensorRT’s performance than in previous releases where the cuDNN version was prominent in the package filename.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • CUDA 12.x is now supported starting with the TensorRT 8.6 release. CUDA 11.x builds from this release or a previous release are not compatible with libraries or applications compiled with CUDA 12.x and may lead to unexpected behavior.
  • Added support for hardware compatibility. This allows an engine built on one GPU architecture to work on GPUs of other architectures. This feature is only supported for NVIDIA Ampere and newer architectures. Note that enabling hardware compatibility may result in a degradation in latency/throughput, and is therefore intended to be used primarily to ease the process of upgrading to new hardware.
  • Added the following layers:
    • IReverseSequence layer has been added to support the ReverseSequence operator in ONNX.
    • INormalization layer has been added to support InstanceNormalization, GroupNormalization, and LayerNormalization operations in ONNX.
  • A new nvinfer1::ICastLayer interface was introduced, which provides a conversion of the data type of the input tensor between FP32, FP16, INT32, INT8, UINT8, and BOOL. The ONNX parser was updated to use ICastLayer instead of IIdentityLayer to implement cast.
  • Introduced restricted runtime installation options when memory consumption is a high priority for deployment: lean or dispatch runtime mode.
    • Lean runtime installation: This installation is significantly smaller than the full installation and allows you to load and run engines that were built with a version compatible builder flag. This installation will not provide the functionality to build a TensorRT plan file.
    • Dispatch runtime installation: This installation allows for deployments with the minimum memory consumption and allows you to load and run engines that were built with a version compatible builder flag and include the lean runtime. This installation does not provide the functionality to build a TensorRT plan file.
    • Added version compatibility support to trtexec. Refer to the NVIDIA TensorRT Developer Guide for more information on the trtexec flags.

    For more information, refer to the NVIDIA TensorRT Installation Guide.

  • By default, TensorRT engines are compatible only with the version of TensorRT with which they are built. Starting in TensorRT 8.6, with the appropriate build-time configuration, engines can be built that are compatible with other TensorRT minor versions within a major version. For more information, refer to Version Compatibility.
    Note: Use of certain version compatibility features requires explicit configuration of the TensorRT runtime to allow host executable code. For more information, refer to Runtime Options.
  • Implemented the following performance enhancements:
    • Improved the Multi-Head Attention (MHA) fusions, speeding up Transformer-like networks.
    • Improved the engine building time and the performance of Transformer-like networks with dynamic shapes.
    • Avoided unnecessary cuStreamSynchronize() calls in enqueueV2() and enqueueV3() when running LSTMs or Transformer-like networks.
    • Improved the performance of various networks on NVIDIA Hopper GPUs.
  • Added an optimization level builder flag, which allows TensorRT to spend more engine building time searching for better tactics, or to build the engine much faster by reducing the searching scope. For more information, refer to Builder Optimization Level.
  • Added the multi-stream APIs, which allows users to control how many streams TensorRT can use to run different parts of the network in parallel, potentially resulting in better performance. For more information, refer to Within-Inference Multi-Streaming.
  • Experimental. Extended DLA so that IElementWiseLayer now supports equal operation (ElementWiseOperation::kEQUAL). This is the first ElementWise logical operation that DLA supports. There are several restrictions and requirements imposed when adopting this operation in DLA. Refer to DLA Supported Layers for more information.
    • One such requirement is that you must explicitly set the device type of the ElementWise equal layer to DLA. To enable this feature on trtexec, it now supports a flag --layerDeviceTypes to let you explicitly specify the device type for individual layers. Refer to Commonly Used Command-line Flags for more information on the new flag.
  • Added a new sample called onnx_custom_plugin, which demonstrates how to use plugins written in C++ to run TensorRT on ONNX models with custom or unsupported layers. For specifics about this sample, refer to the GitHub: /onnx_custom_plugin/README.md file for detailed information about how this sample works, sample code, and step-by-step instructions on how to run and verify its output.
  • Made the following C++ API changes:
    • Added classes:
      • IReverseSequenceLayer
      • INormalizationLayer
      • ILoggerFinder
    • Added macros:
      • NV_TENSORRT_RELEASE_TYPE
      • NV_TENSORRT_RELEASE_TYPE_EARLY_ACCESS
      • NV_TENSORRT_RELEASE_TYPE_RELEASE_CANDIDATE
      • NV_TENSORRT_RELEASE_TYPE_GENERAL_AVAILABILITY
    • Added functions:
      • getBuilderSafePluginRegistry()
      • IAlgorithmIOInfo::getVectorizedDim()
      • IAlgorithmIOInfo::getComponentsPerElement()
      • IBuilderConfig::setBuilderOptimizationLevel()
      • IBuilderConfig::getBuilderOptimizationLevel()
      • IBuilderConfig::setHardwareCompatibilityLevel()
      • IBuilderConfig::getHardwareCompatibilityLevel()
      • IBuilderCOnfig::setPluginsToSerialize()
      • IBuilderConfig::getPluginToSerialize()
      • IBuilderConfig::getNbPluginsToSerialize()
      • IBuilderConfig::getMaxAuxStreams()
      • IBuilderConfig::setMaxAuxStreams()
      • IBuilder::getPluginRegistry()
      • ICudaEngine::getHardwareCompatibilityLevel()
      • ICudaEngine::getNbAuxStreams()
      • IExecutionContext::setAuxStreams()
      • ILayer::setMetadata()
      • ILayer::getMetadata()
      • INetworkDefinition::addCast()
      • INetworkDefinition::addNormalization()
      • INetworkDefinition::addReverseSequence()
      • INetworkDefinition::getBuilder()
      • IPluginRegistry::isParentSearchEnabled()
      • IPluginRegistry::setParentSearchEnabled()
      • IPluginRegistry::loadLibrary()
      • IPluginRegistry::deregisterLibrary()
      • IRuntime::setTemporaryDirectory()
      • IRuntime::getTemporaryDirectory()
      • IRuntime::setTempfileControlFlags()
      • IRuntime::getTempfileControlFlags()
      • IRuntime::getPluginRegistry()
      • IRuntime::setPluginRegistryParent()
      • IRuntime::loadRuntime()
      • IRuntime::setEngineHostCodeAllowed()
      • IRuntime::getEngineHostCodeAllowed()
      • ITopKLayer::setInput()
    • Added enums:
      • HardwareCompatibilityLevel
      • TempfileControlFlag
    • Added enum values:
      • BuilderFLag::kVERSION_COMPATIBLE
      • BuilderFlag::kEXCLUDE_LEAN_RUNTIME
      • DataType::kFP8
      • LayerType::kREVERSE_SEQUENCE
      • LayerType::kNORMALIZATION
      • LayerType::kCAST
      • MemoryPoolType::kTACTIC_DRAM
      • PreviewFeature::kPROFILE_SHARING_0806
      • TensorFormat::kDHWC
      • UnaryOperation::kISINF
    • Deprecated enums:
      • PreviewFeature::kFASTER_DYNAMIC_SHAPES
    • Deprecated macros:
      • NV_TENSORRT_SONAME_MAJOR
      • NV_TENSORRT_SONAME_MINOR
      • NV_TENSORRT_SONAME_PATCH
    • Deprecated funtions:
      • FieldMap::FieldMap()
      • IAlgorithmIOInfo::getTensorFormat()
  • Made the following Python API changes:
    • Added classes:
      • ICastLayer
      • IReverseSequenceLayer
      • INormalizationLayer
    • Added functions:
      • IExecutionContext.set_aux_streams()
      • INetworkDefinition.add_cast()
      • INetworkDefinition.add_normalization()
      • INetworkDefinition.add_reverse_sequence()
      • IPluginRegistry.load_library()
      • IPluginRegistry.deregister_library()
      • IRuntime.load_runtime()
      • ITopKLayer.set_input()
    • Added properties:
      • BuilderFlag.EXCLUDE_LEAN_RUNTIME
      • BuilderFlag.FP8
      • BuilderFlag.VERSION_COMPATIBLE
      • DataType.FP8
      • HardwareCompatibilityLevel.AMPERE_PLUS
      • HardwareCompatibilityLevel.NONE
      • IAlgorithmIOInfo.components_per_element
      • IAlgorithmIOInfo.vectorized_dim
      • IBuilder.get_plugin_registry
      • IBuilderConfig.builder_optimization_level
      • IBuilderConfig.hardware_compatibility_level
      • IBuilderConfig.max_aux_streams
      • IBuilderConfig.plugins_to_serialize
      • ICudaEngine.hardware_compatibility_level
      • ICudaEngine.num_aux_streams
      • ILayer.metadata
      • INetworkDefinition.builder
      • INormalizationLayer.axes
      • INormalizationLayer.compute_precision
      • INormalizationLayer.epsilon
      • INormalizationLayer.num_groups
      • IPluginRegistry.parent_search_enabled
      • IReverseSequenceLayer.batch_axis
      • IReverseSequenceLayer.sequence_axis
      • IRuntime.engine_host_code_allowed
      • IRuntime.load_runtime
      • IRuntime.tempfile_control_flags
      • IRuntime.temporary_directory
      • LayerType.CAST
      • LayerType.NORMALIZATION
      • LayerType.REVERSE_SEQUENCE
      • MemoryPoolType.TACTIC_DRAM
      • PreviewFeature.PROFILE_SHARING_0806
      • TempfileControlFlag.ALLOW_IN_MEMORY_FILES
      • TempfileControlFlag.ALLOW_TEMPORARY_FILES
      • TensorFormat.DHWC
    • Added enums:
      • HardwareCompatibilityLevel
      • TempfileControlFlag
    • Added enum values:
      • UnaryOperation.ISINF
    • Deprecated enums:
      • PreviewFeature.FASTER_DYNAMIC_SHAPES
    • Deprecated properties:
      • IAlgorithmIOInfo.tensor_format

Deprecated API Lifetime

  • APIs deprecated in TensorRT 8.6 will be retained until at least 2/2024.
  • APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.
  • APIs deprecated before TensorRT 8.4 will be removed in TensorRT 9.0.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • The DLA compiler is capable of removing identity transposes, but it cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshape). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshapes in between. The shuffle layer is translated into two consecutive DLA transpose layers, unless you merge the transposes together manually in the model definition in advance.
  • In explicitly quantized networks, a group convolution that has a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, on NVIDIA Hopper™ it may fall back to FP32-IN-FP32-OUT if the input channel count is small. This will be fixed in a future release.
  • On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and PyTorch are unable to run due to missing Python module dependencies. These frameworks have not been built for PowerPC and/or published to standard repositories.
  • TensorRT 8.6 adds nvinfer1::BuilderFlag::kFP8 and nvinfer1::DataType::kFP8 to the public API as preparation for the introduction of FP8 support in future TensorRT releases. Despite these, FP8 (8-bit floating point) is not supported by TensorRT and attempting to use FP8 will result in an error or undefined behavior.
  • nvinfer1::UnaryOperation::kROUND or nvinfer1::UnaryOperation::kSIGN operations of IUnaryLayer are not supported in the implicit batch mode.

Deprecated and Removed Features

The following features are deprecated in TensorRT 8.6.0:
  • Support for CUDA Toolkit 10.2 has been dropped.
  • TensorRT 8.5.3 was the last release supporting NVIDIA Kepler (SM 3.x) and NVIDIA Maxwell (SM 5.x) devices. These devices are no longer supported in TensorRT 8.6. NVIDIA Pascal (SM 6.x) devices are deprecated in TensorRT 8.6.
  • The NvInferRuntimeCommon.h file is being deprecated starting in TensorRT 8.6.0, and will be dropped in TensorRT 9.0. Instead, automotive safety users should include NvInferSafeRuntime.h. All other users should include NvInferRuntime.h.
  • Removed the following samples:
    • C++ samples:
      • sampleSSD
      • sampleUffFasterRCNN
      • sampleUffMNIST
      • sampleUffMaskRCNN
      • sampleUffPluginV2Ext
      • sampleUffSSD
      • sampleMNIST
      • sampleFasterRCNN
      • sampleGoogleNet
      • sampleINT8
      • sampleMNISTAPI
    • Python samples:
      • uff_ssd
      • uff_custom_plugin
      • end_to_end_tensorflow_mnist
      • engine_refit_mnist
      • int8_caffe_mnist
    • introductory_parser_samples - removed from UFF and Caffe options
  • The trtexec argument --buildOnly has been deprecated and was replaced by the argument --skipInference. The argument was renamed to better clarify the intention behind the argument.

Fixed Issues

  • Fixed an up to 20% performance variation between different engines built from the same network for some LSTM networks due to unstable tactic selections. The tactic selection stability has been improved in this release.
  • The r525 or later drivers contain the fix for an up to 11% performance variation for some LSTM networks during inference, depending on the order of CUDA stream creation on NVIDIA Turing GPUs.
  • Fixed the issue that TensorRT might output wrong results when there are GEMM, Conv, and MatMul ops followed by a Reshape op.
  • Improved the H100 performance for some ConvNets in TF32 precision.
  • Improved the H100 performance for some Transformers in FP16 precision.
  • Improved the H100 performance for some 3DUNets.
  • Fixed an up to 6% performance drop for OpenRoadNet networks in TF32 precision compared to TensorRT 8.4 on NVIDIA Ampere architecture GPUs.
  • Fixed an up to 5% performance drop for UNet networks in INT8 precision with explicit quantization on CUDA 11.x compared to CUDA 10.2 on NVIDIA Turing GPUs.
  • Fixed an up to 16% performance drop for LSTM networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Pascal GPUs.
  • Fixed an up to 6% performance drop for ResNeXt-50 QAT networks in INT8, FP16, and FP32 precision at batch-size = 1 compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • H100 performance for some ConvNets containing depthwise convolutions (like QuartzNets and EfficientDet-D0) in INT8 precision was not fully optimized. This issue has been fixed in this release.
  • There was a ~12% performance drop on NVIDIA Ampere architecture GPUs for the BERT network on Windows systems. This issue has been fixed in this release.
  • There was an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs. This issue has been fixed in this release.
  • There was an up to 6% performance drop for WaveRNN networks in FP16 precision compared to TensorRT 8.4 on CUDA 11.8 on NVIDIA Volta GPUs. This issue has been fixed in this release.
  • There was an up to 13% performance drop for Megatron networks in FP16 precision on V100 T4 GPUs when disableExternalTacticSourcesForCore0805 was enabled. This issue has been fixed in this release.
  • There was an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone. This issue has been fixed in this release.
  • Fixed the issue that for some QAT models, when FP16 is enabled and a foreign node is created, if a tensor is the output of the foreign node and also serves as input to another node inside the subgraph of the foreign node, TensorRT may report an error with the following message for the node:
    [W] [TRT] Skipping tactic 0x0000000000000000 due to Myelin error: Concat operation "XXX" has different types of operands.

Known Issues

Functional
  • Resolved an issue with printing Unicode escape sequences while writing JSON files. This fix addresses previous parsing issues with JSON files in the TREx tool.
  • The TensorRT engine might fail to build under QAT mode if paired Quantization and Dequantization layers are not adjacent to each other.
  • TensorRT engine building may fail if the users add an unnamed layer. This is because TensorRT may generate duplicated layer/tensor names for the unnamed layer.
  • The DqQFusion can generate a broken fused node with an empty scale parameter, which violates the following PointWiseFusion’s assertion.
  • The tactic optimizer may still choose a builder that cannot stride according to the timing results when there is a builder for this layer that can stride. If the adjacent reformat node is eliminated in this case, this layer will give outputs with wrong strides.
  • HuggingFace demos can fail if the system CUDA version is older than the CUDA version used to build PyTorch. This is because the environment is using the cublasLT.so from the system CUDA installation instead of the one distributed by PyTorch.
  • There is a small accuracy drop for CortanaASR networks in TF32 precision. This will be fixed in the TensorRT 8.6.1 release.
  • The computation to determine the required number of resize dimensions has been improved; potentially resolving an issue with missing quantized resize implementation where support for the number of resize dimensions is limited to 2.
  • Using the compute sanitizer racecheck tool from CUDA 12.0 may report read-after-write hazard in unusual scenarios. This can be ignored and will be fixed in a future release.
  • Using the compute sanitizer racecheck tool may cause the process to be terminated unexpectedly. The root cause is a wrong false alarm. The issue can be bypassed with --kernel-regex-exclude kns=scudnn_winograd.
  • If a network has a tensor of type bool with an implicitly data-dependent shape, engine building will likely fail.
  • For networks such as Tacotron 2 decoder, where there is a Convolution operation within a loop body, TensorRT can potentially fail during compilation. This issue will be fixed in the 8.6.1 release
  • There is a known functional issue with thread safety when using multiple TensorRT builders concurrently. This issue will be fixed in the 8.6.1 release.
  • When using hardware compatibility features, TensorRT can potentially fail while compiling transformer based networks such as BERT. This issue will be fixed in the 8.6.1 release.
  • There is an occurance of use-after-free when using NVRTC that is included with CUDA 12.0. You may encounter a crash in unusual scenarios. This issue will be fixed in the 8.6.1 release.
  • Although the version compatible runtime is optimized for efficiency, it may result in slower performance than the full runtime in certain use cases. Most networks can expect no more than a 10% slowdown when using a version-compatible engine compared to a version-locked engine. However, in some cases, a larger performance drop may occur. For example:
    • When running ResNet50_v2 with QAT, there may be up to a 22% decrease in performance.
    • When running DynUNet in FP16 precision, there may be up to a 32% decrease in performance.
    • When running ONNX networks with InstanceNormalization operations, there may be up to a 50% decrease in performance.
  • TensorRT compiled for CUDA 11.4 may fail to compile a graph when there are GEMM ops followed by a gelu_erf op.
  • There is a known issue with huge graphs that cause out of memory errors with specific input shapes even though a larger input shape can be run.
  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • The Python sample yolov3_onnx has a known issue when installing the requirements with Python 3.10. The recommendation is to use a Python version < 3.10 when running the sample.
  • The auto-tuner assumes that the number of indices returned by INonZeroLayer is half of the number of input elements. Thus, networks that depend on tighter assumptions for correctness may fail to build.
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
  • When using DLA, INT8 convolutions followed by FP16 layers may cause accuracy degradation. In such cases, either change the convolution to FP16 or the subsequent layer to INT8.
  • When using the algorithm selector API, the HWC1 and HWC4 DLA formats are both reported as TensorFormat::kDLA_HWC4.
  • For transformer decoder-based models (such as GPT2) with sequence length as dynamic, TensorRT 8.5 requires additional workspace (up to 2x) as compared to previous releases.
  • For some QAT models, if convolution and pointwise fusion results in a multi-output layer with some output tensors quantized and others not, the building of the engine may fail with the following error message:
    [E] Error[2]: [optimizer.cpp::filterQDQFormats::4422] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. All of the candidates were removed, which points to the node being incorrectly marked as an int8 node.
    One workaround is to disable the kJIT_CONVOLUTIONS tactic source.
  • For some networks containing matrix multiplication operations on A100, using TF32 may cause accuracy degradation. Try disabling TF32 as a workaround.
  • Using the compute sanitizer tool from CUDA 12.0 may report a cudaErrorIllegalInstruction error on Hopper GPUs in unusual scenarios. This can be ignored, and will be fixed in a future CUDA release.
  • There is a known functionality issue when cross compiling TensorRT samples. Try setting the TRT_LIB_DIR environment manually as a workaround.
Performance
  • Explicit quantization on convolution with dynamic weights will fail to build on some platforms like Windows. This issue will be addressed in a future release.
  • There is an up to 10% performance regression for WaveRNN on Turing GPUs in FP16 precision
  • There may be minor performance regressions when running ONNX models with InstanceNormalization operators in version compatible mode. Refer to the NVIDIA TensorRT Developer Guide for more information.
  • Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.
  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.
  • There is an up to 15% performance drop for the CTFM model on V100 compared to TensorRT 8.5.1.
  • There is an up to 72% increase in GPU memory usage on H100 for various networks.
  • There is an up to 30% performance regression for LSTM variants with dynamic shapes. This issue can be resolved by disabling the kFASTER_DYNAMIC_SHAPES_0805 preview feature in TensorRT 8.6.
  • There is a known issue on H100 that may lead to GPU hang when running TensorRT with high persistentCache usage. Limit the usage to 40% of L2 cache size as a workaround.
  • There is an up to 31% performance drop for DeepASR on Turing GPUs in FP16 precision compared to TensorRT 8.5.1.
  • There is an up to 12% performance drop for BERT on H100 in FP16 precision compared to TensorRT 8.5.0 EA.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for T5 networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs due to a functionality fix.
  • There is an up to 17% performance drop for LSTM on Windows in FP16 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • There is an up to 7% performance drop for Artifact Reduction networks involving Deconvolution ops in INT8 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • With the kFASTER_DYNAMIC_SHAPES_0805 preview feature enabled on the GPT style decoder models, there can be an up to 20% performance regression for odd sequence lengths only compared to TensorRT without the use of the preview feature.
  • There is an up to 16% performance regression on GPT2, T5, and Temporal-Fusion Transformers on NVIDIA Turing GPUs in FP32 precision due to a necessary accuracy fix. To recover the performance, enable FP16 precision.
  • There is an up to 9% performance regression compared to TensorRT 8.5 on Yolov3 batch size 1 in FP32 precision on NVIDIA Ada Lovelace GPUs.
  • There is an up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.

2.3. TensorRT Release 8.5.3

These are the TensorRT 8.5.3 Release Notes and are applicable to x86 Linux, Windows, and JetPack users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.
  • APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or later).
  • The DLA compiler is capable of removing identity transposes, but it cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshape). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshapes in between. The shuffle layer is translated into two consecutive DLA transpose layers, unless you merge the transposes together manually in the model definition in advance.
  • In QAT networks, for group convolution that has a Q/DQ pair before but no Q/DQ pair after, we can run in INT8-IN-FP32-OUT mix precision before. However, GPU kernels may be missed and fall back to FP32-IN-FP32-OUT in the NVIDIA Hopper™ architecture GPUs if the input channel is small. This will be fixed in the future release.
  • On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and PyTorch are unable to run due to missing Python module dependencies. These frameworks have not been built for PowerPC and/or published to standard repositories.

Deprecated and Removed Features

The following features are deprecated in TensorRT 8.5.3:
  • TensorRT 8.5.3 will be the last release supporting NVIDIA Kepler (SM 3.x) and NVIDIA Maxwell (SM 5.x) devices. These devices will no longer be supported in TensorRT 8.6. NVIDIA Pascal (SM 6.x) devices will be deprecated in TensorRT 8.6.
  • In the next TensorRT release, CUDA Toolkit 10.2 support will be dropped.

Fixed Issues

  • For INT8 fused MHA plugins, support for sequence length 64 and 96 has been added.
  • There was an issue on the PyTorch container where some ONNX models would fail with the error message SSA validation FAIL. This issue has now been fixed.
  • When using IExecutionContext::enqueueV3, a non-null address must be set for every input tensor, using IExecutionContext::setInputTensorAddress or IExecutionContext::setTensorAddress, even if nothing is computed from the input tensor.
  • ISliceLayer with SampleMode::kWRAP sometimes caused engine build failures when one of the strides was 0.This issue has been fixed.
  • There was an issue when a network used implicit batch and was captured using cudaGraph. This issue has now been fixed.
  • When the PreviewFeature kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 is used, some plugins that use cuBLAS reported an CUBLAS_STATUS_NOT_INITIALIZED error (with CUDA version 11.8). This issue has now been fixed.
  • For some encoder based transformer networks, if there was a forced precision on some layers, TensorRT reports the error (Mismatched type for tensor) during compilation. This issue has now been fixed.
  • For some networks with branches, if there was a forced precision on some layers on one side, TensorRT reports the error Mismatched type for tensor during compilation. This issue has now been fixed.
  • An assertion was triggered when multiple profiles were utilized and the profiles ended up causing different optimizations to occur, thus resulting in an error on the amount of slots being insufficient. This has been fixed to properly initialize the slots used internally.
  • When a Matrix Multiply horizontal fusion pass fuses two layers with bias, the bias fusion didn't handle correctly which led to accuracy issues. This issue has now been fixed.
  • There was an accuracy issue when a network contains an Exp operator and the Exp operator has a constant input. This issue has now been fixed.
  • There was an up to 16% performance drop for LSTM networks in FP32 precision compared to TensorRT 8.4 on Pascal GPUs. This issue has now been fixed.
  • TensorRT might output wrong results when there were GEMM/Conv/MatMul ops followed by a Reshape op. This issue has now been fixed.

Announcements

  • In the next TensorRT release, cuDNN, cuBLAS, and cuBLASLt tactic sources will be turned off by default in builder profiling. TensorRT plans to remove the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. Use the PreviewFeature flag kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to evaluate the functional and performance impact of disabling cuBLAS and cuDNN and report back to TensorRT if there are critical regressions in your use cases.
  • TensorRT Python wheel files before TensorRT 8.5, such as TensorRT 8.4, were published to the NGC PyPI repo. Starting with TensorRT 8.5, Python wheels will instead be published to upstream PyPI. This will make it easier to install TensorRT because it requires no prerequisite steps. Also, the name of the Python package for installation has changed from nvidia-tensorrt to just tensorrt.
  • The C++ and Python API documentation in previous releases was included inside the tar file packaging. This release no longer bundles the documentation inside the tar file since the online documentation can be updated post release and avoids encountering mistakes found in stale documentation inside the packages.

Known Issues

Functional
  • TensorRT compiled for CUDA 11.4 may fail to compile a graph when there are GEMM ops followed by a gelu_erf op.
  • There is a known issue with huge graphs that cause out of memory errors with specific input shapes even though a larger input shape can be run.
  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • The Python sample yolov3_onnx has a known issue when installing the requirements with Python 3.10. The recommendation is to use a Python version < 3.10 when running the sample.
  • The auto-tuner assumes that the number of indices returned by INonZeroLayer is half of the number of input elements. Thus, networks that depend on tighter assumptions for correctness may fail to build.
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.
  • One of the deconvolution algorithms sourced from cuDNN exhibits non-deterministic execution. Disabling cuDNN tactics will prevent this algorithm from being chosen (refer to IBuilderConfig::setTacticSources).
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fall back to using cuBLAS. (not applicable for Jetson platforms)
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
  • When using DLA, an elementwise, unary, or activation layer immediately followed by a scale layer may lead to accuracy degradation in INT8 mode. Note that this is a pre-existing issue also found in previous releases rather than a regression.
  • When using DLA, INT8 convolutions followed by FP16 layers may cause accuracy degradation. In such cases, either change the convolution to FP16 or the subsequent layer to INT8.
  • When using the algorithm selector API, the HWC1 and HWC4 DLA formats are both reported as TensorFormat::kDLA_HWC4.
  • For transformer decoder based models (such as GPT2) with sequence length as dynamic, TensorRT 8.5 requires additional workspace (up to 2x) as compared to previous releases.
  • For some QAT models, if convolution and pointwise fusion results in a multi-output layer with some output tensors quantized and others not, the building of the engine may fail with the following error message:
    [E] Error[2]: [optimizer.cpp::filterQDQFormats::4422] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. All of the candidates were removed, which points to the node being incorrectly marked as an int8 node.
    One workaround is to disable the kJIT_CONVOLUTIONS tactic source.
  • For some QAT models, when FP16 is enabled and a foreign node is created, if a tensor is the output of the foreign node and also serves as input to another node inside the subgraph of the foreign node, TensorRT may report an error with the following message for the node:
    [W] [TRT] Skipping tactic 0x0000000000000000 due to Myelin error: Concat operation "XXX" has different types of operands.
    One workaround is insert a cast node between the tensor and the node inside the foreign node.
Performance
  • There is a ~12% performance drop on NVIDIA Ampere architecture GPUs for the BERT network on Windows systems.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 22% performance drop for Jasper networks compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 5% performance drop for the InceptionV4 network compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is an up to 20% performance variation between different engines built from the same network for some LSTM networks due to unstable tactic selections.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is an up to 10-11% performance regression on Xavier compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 5% performance drop for Megatron networks in FP32 precision at batch-size = 1 between CUDA 11.8 and CUDA 10.2 on Volta GPUs. This performance drop does not happen on Turing or later GPUs.
  • H100 performance for some ConvNets in TF32 precision is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for ResNeXt-50 QAT networks in INT8, FP16, and FP32 precision at batch-size = 1 compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • H100 performance for some Transformers in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some ConvNets containering depthwise convolutions (like QuartzNets and EfficientDet-D0) in INT8 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some 3DUnets is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for OpenRoadNet networks in TF32 precision compared to TensorRT 8.4 on NVIDIA Ampere architecture GPUs.
  • There is an up to 6% performance drop for T5 networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs due to a functionality fix.
  • There is an up to 5% performance drop for UNet networks in INT8 precision with explicit quantization on CUDA 11.x compared to CUDA 10.2 on Turing GPUs.
  • There is an up to 6% performance drop for WaveRNN networks in FP16 precision compared to TensorRT 8.4 on CUDA 11.8 on Volta GPUs. Downgrading CUDA to CUDA 11.6 fixes the issue.
  • There is an up to 13% performance drop for Megatron networks in FP16 precision on Tesla T4 GPUs when disableExternalTacticSourcesForCore0805 is enabled.
  • There is an up to 17% performance drop for LSTM on Windows in FP16 precision compared to TensorRT 8.4 on Volta GPUs.
  • There is an up to 7% performance drop for Artifact Reduction networks involving Deconvolution ops in INT8 precision compared to TensorRT 8.4 on Volta GPUs.
  • With the kFASTER_DYNAMIC_SHAPES_0805 preview feature enabled on the GPT style decoder models, there can be an up to 20% performance regression for odd sequence lengths only compared to TensorRT without the use of the preview feature.

2.4. TensorRT Release 8.5.2

These are the TensorRT 8.5.2 Release Notes and are applicable to x86 Linux, Windows, JetPack, and PowerPC Linux users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added new plugins: fused Multihead Self-Attention, fused Multihead Cross-Attention, Layer Normalization, Group Normalization, and targeted fusions (such as Split+GeLU) to support the Stable Diffusion demo.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.
  • APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or later).
  • The DLA compiler is capable of removing identity transposes, but it cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshape). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshapes in between. The shuffle layer is translated into two consecutive DLA transpose layers, unless you merge the transposes together manually in the model definition in advance.
  • In QAT networks, for group convolution that has a Q/DQ pair before but no Q/DQ pair after, we can run in INT8-IN-FP32-OUT mix precision before. However, GPU kernels may be missed and fall back to FP32-IN-FP32-OUT in the NVIDIA Hopper™ architecture GPUs if the input channel is small. This will be fixed in the future release.
  • On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and PyTorch are unable to run due to missing Python module dependencies. These frameworks have not been built for PowerPC and/or published to standard repositories.

Deprecated and Removed Features

The following features are deprecated in TensorRT 8.5.2:
  • TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x) devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT 9.0.

Fixed Issues

  • Fixed the accuracy issue of BART with batch size > 1.
  • Memory estimation for dynamic shapes has been improved, which allows some models to run that previously did not.
  • The ONNX parser recognizes the allowzero attribute on Reshape operations for opset 5 and higher, even though ONNX spec requires it only for opset 14 and higher. Setting this attribute to 1 can correct networks that are incorrect for empty tensors, and let TensorRT analyze the memory requirements for dynamic shapes more accurately.
  • Documentation for getProfileShapeValues() incorrectly cited getProfileShape() as the preferred replacement. Now, the documentation has been corrected to cite getShapeValues() as the replacement.
  • TensorRT header files compile with gcc compilers when specified with both -Wnon-virtual-dtor and -Werror compilation options.
  • The ONNX parser was getting incorrect min and max values when they were FP16 numbers. This has been fixed in this release.
  • Improved TensorRT error handling to skip tactics instead of crashing the builder.
  • Fixed the segment fault issue for the EfficientDet network with batch size > 1.
  • Fixed the could not find any implementation error for 3D convolution with depth of kernel size equal to 1.
  • Added an error message in the ONNX parser when traning_mode=1 in BatchNorm operator.
  • Replaced the Python sample for creating custom plugins with a new sample that uses the ONNX parser instead of the UFF parser.
  • Fixed an issue which prohibited some graph fusions and caused performance degradation for StableDiffusion in FP16 precision.
  • Fixed an issue that caused some nodes within conditional subgraphs to be unsupported in INT8 calibration mode.

Announcements

  • In the next TensorRT release, CUDA toolkit 10.2 support will be dropped.
  • TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x) devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT 9.0.
  • In the next TensorRT release, cuDNN, cuBLAS, and cuBLASLt tactic sources will be turned off by default in builder profiling. TensorRT plans to remove the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. Use the PreviewFeature flag kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to evaluate the functional and performance impact of disabling cuBLAS and cuDNN and report back to TensorRT if there are critical regressions in your use cases.
  • TensorRT Python wheel files before TensorRT 8.5, such as TensorRT 8.4, were published to the NGC PyPI repo. Starting with TensorRT 8.5, Python wheels will instead be published to upstream PyPI. This will make it easier to install TensorRT because it requires no prerequisite steps. Also, the name of the Python package for installation has changed from nvidia-tensorrt to just tensorrt.
  • The C++ and Python API documentation in previous releases was included inside the tar file packaging. This release no longer bundles the documentation inside the tar file since the online documentation can be updated post release and avoids encountering mistakes found in stale documentation inside the packages.

Known Issues

Functional
  • There is a known issue with huge graphs that cause out of memory errors with specific input shapes even though a larger input shape can be run.
  • TensorRT might output wrong results when there are GEMM/Conv/MatMul ops followed by a Reshape op.
  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • The Python sample yolov3_onnx has a known issue when installing the requirements with Python 3.10. The recommendation is to use a Python version < 3.10 when running the sample.
  • The auto-tuner assumes that the number of indices returned by INonZeroLayer is half of the number of input elements. Thus, networks that depend on tighter assumptions for correctness may fail to build.
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.
  • One of the deconvolution algorithms sourced from cuDNN exhibits non-deterministic execution. Disabling cuDNN tactics will prevent this algorithm from being chosen (refer to IBuilderConfig::setTacticSources).
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fall back to using cuBLAS. (not applicable for Jetson platforms)
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
  • When using DLA, an elementwise, unary, or activation layer immediately followed by a scale layer may lead to accuracy degradation in INT8 mode. Note that this is a pre-existing issue also found in previous releases rather than a regression.
  • When using DLA, INT8 convolutions followed by FP16 layers may cause accuracy degradation. In such cases, either change the convolution to FP16 or the subsequent layer to INT8.
  • When using the algorithm selector API, the HWC1 and HWC4 DLA formats are both reported as TensorFormat::kDLA_HWC4.
  • For transformer decoder based models (such as GPT2) with sequence length as dynamic, TensorRT 8.5 requires additional workspace (up to 2x) as compared to previous releases.
Performance
  • There is a ~12% performance drop on NVIDIA Ampere architecture GPUs for the BERT network on Windows systems.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 22% performance drop for Jasper networks compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 5% performance drop for the InceptionV4 network compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is an up to 20% performance variation between different engines built from the same network for some LSTM networks due to unstable tactic selections.
  • There is an up to 11% performance variation for some LSTM networks during inference depending on the order of CUDA stream creation on NVIDIA Turing GPUs. This will be fixed in r525 drivers.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is an up to 10-11% performance regression on Xavier compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 5% performance drop for Megatron networks in FP32 precision at batch-size = 1 between CUDA 11.8 and CUDA 10.2 on NVIDIA Volta GPUs. This performance drop does not happen on NVIDIA Turing or later GPUs.
  • There is an up to 23% performance drop between H100 and A100 for some ConvNets in TF32 precision when running at the same SM clock frequency. This will be improved in future TRT versions.
  • There is an up to 8% performance drop between H100 and A100 for some transformers, including BERT, BART, T5, and GPT2, in FP16 precision at BS=1 when running at the same SM clock frequency. This will be improved in future TensorRT versions.
  • H100 performance for some ConvNets in TF32 precision is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for ResNeXt-50 QAT networks in INT8, FP16, and FP32 precision at batch-size = 1 compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • H100 performance for some Transformers in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some ConvNets containering depthwise convolutions (like QuartzNets and EfficientDet-D0) in INT8 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some 3DUnets is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for OpenRoadNet networks in TF32 precision compared to TensorRT 8.4 on NVIDIA Ampere architecture GPUs.
  • There is an up to 6% performance drop for T5 networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs due to a functionality fix.
  • There is an up to 5% performance drop for UNet networks in INT8 precision with explicit quantization on CUDA 11.x compared to CUDA 10.2 on NVIDIA Turing GPUs.
  • There is an up to 6% performance drop for WaveRNN networks in FP16 precision compared to TensorRT 8.4 on CUDA 11.8 on NVIDIA Volta GPUs. Downgrading CUDA to CUDA 11.6 fixes the issue.
  • There is an up to 13% performance drop for Megatron networks in FP16 precision on Tesla T4 GPUs when disableExternalTacticSourcesForCore0805 is enabled.
  • There is an up to 16% performance drop for LSTM networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Pascal GPUs.
  • There is an up to 17% performance drop for LSTM on Windows in FP16 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • There is an up to 7% performance drop for Artifact Reduction networks involving Deconvolution ops in INT8 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • With the kFASTER_DYNAMIC_SHAPES_0805 preview feature enabled on the GPT style decoder models, there can be an up to 20% performance regression for odd sequence lengths only compared to TensorRT without the use of the preview feature.

2.5. TensorRT Release 8.5.1

These are the TensorRT 8.5.1 Release Notes and are applicable to x86 Linux, Windows, JetPack, and PowerPC Linux users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added support for NVIDIA Hopper™ (H100) architecture. TensorRT now supports compute capability 9.0 deep learning kernels for FP32, TF32, FP16, and INT8, using the H100 Tensor Cores and delivering increased MMA throughput over A100. These kernels also benefit from new H100 features such as Asynchronous Transaction Barriers, Tensor Memory Accelerator (TMA), and Thread Block Clusters for increased efficiency.
  • Added support for NVIDIA Ada Lovelace architecture. TensorRT now supports compute capability 8.9 deep learning kernels for FP32, TF32, FP16, and INT8.
  • Ubuntu 22.04 packages are now provided in both the CUDA network repository and in the local repository format starting with this release.
  • Shapes of tensors can now depend on values computed on the GPU. For example, the last dimension of the output tensor from INonZeroLayer depends on how many input values are non-zero. For more information, refer to Dynamically Shaped Output.
  • TensorRT supports named input dimensions. In an ONNX model, two dimensions with the same named dimension parameter are considered equal. For more information, refer to Named Dimensions.
  • TensorRT supports offloading the IShuffleLayer to DLA. Refer to Layer Support and Restrictions for details on the restrictions for running IShuffleLayer on DLA.
  • Added the following layers:
    • IGatherLayer, ISliceLayer, IConstantLayer, and IConcatenation layers have been updated to support boolean types.
    • INonZeroLayer, INMSLayer (non-max suppression), IOneHotLayer, and IGridSampleLayer.
    For more information, refer to the NVIDIA TensorRT Operator’s Reference.
  • TensorRT supports heuristic-based builder tactic selection. This is controlled by nvinfer1::BuilderFlag::kENABLE_TACTIC_HEURISTIC. For more information, refer to Tactic Selection Heuristic.
  • The builder timing cache has been updated to support transformer-based networks such as BERT and GPT. For more information, refer to Timing Cache.
  • TensorRT supports the RoiAlign ONNX operator through the newly added RoiAlign plug-in. Both opset-10 and opset-16 versions of the operator are supported. For more information about the supported ONNX operators, refer to GitHub.
  • TensorRT supports disabled external tactic sources including cuDNN and cuBLAS use in the core library and allows the usage of cuDNN and cuBLAS in a plug-in by setting the preview feature flag: nvinfer1::PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 using IBuilderConfig::setPreviewFeature.
  • TensorRT supports a new preview feature nvinfer1::PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805, which aims to reduce build time, runtime and memory requirements for dynamic shaped transformer-based networks
  • TensorRT supports Lazy Module Loading; a CUDA feature, which can significantly reduce the amount of GPU memory consumed. Refer to the NVIDIA CUDA Programming Guide and Lazy Module Loading for more information.
  • TensorRT supports persistent cache; a CUDA feature, which allows data cached in L2 persistently. Refer to Persistent Cache Management for more information.
  • TensorRT supports a preview feature API, which is a mechanism that enables you to opt in for specific experimental features. Refer to Preview Features for more information.
  • The following C++ API functions were added:
    • ITensor::setDimensionName()
    • ITensor::getDimensionName()
    • IResizeLayer::setCubicCoeff()
    • IResizeLayer::getCubicCoeff()
    • IResizeLayer::setExcludeOutside()
    • IResizeLayer::getExcludeOutside()
    • IBuilderConfig::setPreviewFeature()
    • IBuilderConfig::getPreviewFeature()
    • ICudaEngine::getTensorShape()
    • ICudaEngine::getTensorDataType()
    • ICudaEngine::getTensorLocation()
    • ICudaEngine::isShapeInferenceIO()
    • ICudaEngine::getTensorIOMode()
    • ICudaEngine::getTensorBytesPerComponent()
    • ICudaEngine::getTensorComponentsPerElement()
    • ICudaEngine::getTensorFormat()
    • ICudaEngine::getTensorFormatDesc()
    • ICudaEngine::getProfileShape()
    • ICudaEngine::getNbIOTensors()
    • ICudaEngine::getIOTensorName()
    • IExecutionContext::getTensorStrides()
    • IExecutionContext::setInputShape()
    • IExecutionContext::getTensorShape()
    • IExecutionContext::setTensorAddress()
    • IExecutionContext::getTensorAddress()
    • IExecutionContext::setInputTensorAddress()
    • IExecutionContext::getOutputTensorAddress()
    • IExecutionContext::inferShapes()
    • IExecutionContext::setInputConsumedEvent()
    • IExecutionContext::getInputConsumedEvent()
    • IExecutionContext::setOutputAllocator()
    • IExecutionContext::getOutputAllocator()
    • IExecutionContext::getMaxOutputSize()
    • IExecutionContext::setTemporaryStorageAllocator()
    • IExecutionContext::getTemporaryStorageAllocator()
    • IExecutionContext::enqueueV3()
    • IExecutionContext::setPersistentCacheLimit()
    • IExecutionContext::getPersistentCacheLimit()
    • IExecutionContext::setNvtxVerbosity()
    • IExecutionContext::getNvtxVerbosity()
    • INetworkDefinition::addOneHot()
    • INetworkDefinition::addNonZero()
    • INetworkDefinition::addGridSample()
    • INetworkDefinition::addNMS()
  • The following C++ classes were added:
    • IOneHotLayer
    • IGridSampleLayer
    • INonZeroLayer
    • INMSLayer
    • IOutputAllocator
  • The following C++ enum values were added:
    • InterpolationMode::kCUBIC
    • FillOperation::kRANDOM_NORMAL
    • BuilderFlag::kREJECT_EMPTY_ALGORITHMS
    • BuilderFlag::kENABLE_TACTIC_HEURISTIC
    • TacticSource::kJIT_CONVOLUTIONS
    • DataType::kUINT8
  • The following C++ enum classes were added:
    • TensorIOMode
    • PreviewFeature
  • The following Python API functions/properties were added:
    • ITensor.set_dimension_name()
    • ITensor.get_dimension_name()
    • IResizeLayer.cubic_coeff
    • IResizeLayer.exclude_outside
    • IBuilderConfig.set_preview_feature()
    • IBuilderConfig.get_preview_feature()
    • ICudaEngine.get_tensor_shape()
    • ICudaEngine.get_tensor_dtype()
    • ICudaEngine.get_tensor_location()
    • ICudaEngine.is_shape_inference_io()
    • ICudaEngine.get_tensor_mode()
    • ICudaEngine.get_tensor_bytes_per_component()
    • ICudaEngine.get_tensor_components_per_element()
    • ICudaEngine.get_tensor_format()
    • ICudaEngine.get_tensor_format_desc()
    • ICudaEngine.get_tensor_profile_shape()
    • ICudaEngine.num_io_tensors
    • ICudaEngine.get_tensor_name()
    • IExecutionContext.get_tensor_strides()
    • IExecutionContext.set_input_shape()
    • IExecutionContext.get_tensor_shape()
    • IExecutionContext.set_tensor_address()
    • IExecutionContext.get_tensor_address()
    • IExecutionContext.infer_shapes()
    • IExecutionContext.set_input_consumed_event()
    • IExecutionContext.get_input_consumed_event()
    • IExecutionContext.set_output_allocator()
    • IExecutionContext.get_output_allocator()
    • IExecutionContext.get_max_output_size()
    • IExecutionContext.temporary_allocator
    • IExecutionContext.execute_async_v3()
    • IExecutionContext.persistent_cache_limit
    • IExecutionContext.nvtx_verbosity
    • INetworkDefinition.add_one_hot()
    • INetworkDefinition.add_non_zero()
    • INetworkDefinition.add_grid_sample()
    • INetworkDefinition.add_nms()
  • The following Python classes were added:
    • IOneHotLayer
    • IGridSampleLayer
    • INonZeroLayer
    • INMSLayer
    • IOutputAllocator
  • The following Python enum values were added:
    • InterpolationMode.CUBIC
    • FillOperation.RANDOM_NORMAL
    • BuilderFlag.REJECT_EMPTY_ALGORITHMS
    • BuilderFlag.ENABLE_TACTIC_HEURISTIC
    • TacticSource.JIT_CONVOLUTIONS
    • DataType.UINT8
  • The following Python enum classes were added:
    • TensorIOMode
    • PreviewFeature
  • Removed the TensorRT layers chapter from the NVIDIA TensorRT Developer Guide appendix section and created a standalone NVIDIA TensorRT Operator’s Reference document.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.
  • APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or later).
  • The DLA compiler is capable of removing identity transposes, but it cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshape). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshapes in between. The shuffle layer is translated into two consecutive DLA transpose layers, unless you merge the transposes together manually in the model definition in advance.
  • In QAT networks, for group convolution that has a Q/DQ pair before but no Q/DQ pair after, we can run in INT8-IN-FP32-OUT mix precision before. However, GPU kernels may be missed and fall back to FP32-IN-FP32-OUT in the NVIDIA Hopper™ architecture GPUs if the input channel is small. This will be fixed in the future release.
  • On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and PyTorch are unable to run due to missing Python module dependencies. These frameworks have not been built for PowerPC and/or published to standard repositories.

Deprecated and Removed Features

The following features are deprecated in TensorRT 8.5.1:
  • TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x) devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT 9.0.

Fixed Issues

  • TensorRT’s optimizer would sometimes incorrectly retarget Concat layer inputs produced by Cast layers, resulting in data corruption. This has been fixed in this release.
  • There was an up to 5% performance drop for the ShuffleNet network compared to TensorRT 8.2 when running in INT8 precision on NVIDIA Ampere architecture GPUs. This has been fixed in this release.
  • There was an up to 10% performance difference for the WaveRNN network between different OS when running in FP16 precision on NVIDIA Ampere architecture GPUs. This issue has been fixed in this release.
  • When the TensorRT static library was used to build engines and the NVPTXCompiler static library was used outside of the TensorRT core library at the same time, it was possible to trigger a crash of the process in rare cases. This issue has been fixed in this release.
  • There was a known issue when ProfilingVerbosity is set to kDETAILED, the enqueueV2() call might take up to 2 ms compared to ProfilingVerbosity=kNONE or kLAYER_NAMES_ONLY. Now, you can use the setNvtxVerbosity() API to disable the costly detailed NVTX generation at runtime without the need to rebuild the engine.
  • There was a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode
    • up to 10% in FP16 mode on NVIDIA Maxwell® and NVIDIA Pascal® GPUs
    This issue has been fixed in this release.
  • There was an issue when performing PTQ with TensorRT with tensors rank > 4, some layers may cause an assertion about invalid Region Dims. Empty tensors in the network may also cause a seg fault within the INT8 calibration process. These issues were fixed in this release.
  • When performing an L2_Normalization in float16 precision, there was undefined behavior occurring from a fusion. This fusion could be disabled by marking the input to the L2_Normalization as a network output. This issue has been fixed in this release.
  • TensorRT used to incorrectly allow subnetworks of the input network with > 16 I/O tensors to be offloaded to DLA, due to intermediate tensors of the subnetwork that were also the full network outputs not having been counted as I/O tensors. This has been fixed. You may infrequently experience a slightly increased fragmentation.
  • There was a ~19% performance drop on NVIDIA Turing GPUs for the DRIVENet network in FP16 precision compared to TensorRT 8.4. This regression has been fixed in this release.
  • On NVIDIA Hopper GPUs, the QKV plugin, which is used by the open-source BERT demo, would fail for fixed sequence lengths of 128 and 384 with FP16. This issue has been fixed in this release.
  • To compile the DLA samples, you can now use the listed commands in the README.
  • In convolution or GEMM layers, some extra tactic information would be printed out. This issue has been fixed in this release.
  • When using QAT, horizontal fusion of convolutions at the end of the net would incorrectly propagate the quantization scales of the weights, resulting in incorrect outputs. This issue has been fixed in this release.
  • With TensorRT ONNX runtime, building an engine for a large graphic would fail due to the implementation of a foreign node not being found. This issue has been fixed in this release.
  • The BatchedNMSDynamicPlugin supports dynamic batch only. The behavior was undefined if other dimensions were set to be dynamic. This issue has been fixed in this release.
  • On NVIDIA Hopper GPUs, the QKV plug-in, which is used by the open-source BERT demo, would produce inaccurate results for sequence lengths other than 128 and 384. This issue has been fixed in this release.
  • A new tactic source kJIT_CONVOLUTIONS was added, however, enabling or disabling it had no impact as it was still in development. This issue has been fixed in this release.
  • There was a known issue when using INT8 calibration for networks with ILoop or IIfConditional layers. This issue has been fixed in this release.
  • There was a ~17% performance drop on NVIDIA Ampere GPUs for the inflated 3D video classification network in TF32 precision compared to TensorRT 8.4.
  • There were up to 25% performance drops for various networks on SBSA systems compared to TensorRT 8.4.
  • There was a ~15% performance drop on NVIDIA Ampere GPUs for the ResNet_v2_152 network in TF32 precision compared to TensorRT 8.4.
  • There was a 17% performance drop for networks containing Deconv+Concat or Slice+Deconv patterns.
  • There was a ~9% performance drop on Volta and Turing GPUs for the WaveRNN network.
  • Some networks would see a small increase in deserialization time. This issue has been fixed in this release.
  • When using QAT, horizontal fusion of two or more convolutions that have quantized inputs and non-quantized outputs would result in incorrect weights quantization. This issue has been fixed in this release.
  • When invoking IQuantizeLayer::setAxis with the axis set to -1, the graph optimization process triggered an assertion. This issue has been fixed in this release.
  • If the values (not just dimensions) of an output tensor from a plugin were used to compute a shape, the engine would fail to build. This issue has been fixed in this release.

Announcements

  • In the next TensorRT release, CUDA toolkit 10.2 support will be dropped.
  • TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x) devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT 9.0.
  • In the next TensorRT release, cuDNN, cuBLAS, and cuBLASLt tactic sources will be turned off by default in builder profiling. TensorRT plans to remove the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. Use the PreviewFeature flag kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to evaluate the functional and performance impact of disabling cuBLAS and cuDNN and report back to TensorRT if there are critical regressions in your use cases.
  • TensorRT Python wheel files before TensorRT 8.5, such as TensorRT 8.4, were published to the NGC PyPI repo. Starting with TensorRT 8.5, Python wheels will instead be published to upstream PyPI. This will make it easier to install TensorRT because it requires no prerequisite steps. Also, the name of the Python package for installation has changed from nvidia-tensorrt to just tensorrt.
  • The C++ and Python API documentation in previous releases was included inside the tar file packaging. This release no longer bundles the documentation inside the tar file since the online documentation can be updated post release and avoids encountering mistakes found in stale documentation inside the packages.

Known Issues

Functional
  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.
    {
      Memory leak errors with dlopen.
       Memcheck:Leak
       match-leak-kinds: definite
       ...
       fun:*dlopen*
       ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • The Python sample yolov3_onnx has a known issue when installing the requirements with Python 3.10. The recommendation is to use a Python version < 3.10 when running the sample.
  • The auto-tuner assumes that the number of indices returned by INonZeroLayer is half of the number of input elements. Thus, networks that depend on tighter assumptions for correctness may fail to build.
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.
  • One of the deconvolution algorithms sourced from cuDNN exhibits non-deterministic execution. Disabling cuDNN tactics will prevent this algorithm from being chosen (refer to IBuilderConfig::setTacticSources).
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fall back to using cuBLAS. (not applicable for Jetson platforms)
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
  • When using DLA, an elementwise, unary, or activation layer immediately followed by a scale layer may lead to accuracy degradation in INT8 mode. Note that this is a pre-existing issue also found in previous releases rather than a regression.
  • When using DLA, INT8 convolutions followed by FP16 layers may cause accuracy degradation. In such cases, either change the convolution to FP16 or the subsequent layer to INT8.
  • When using the algorithm selector API, the HWC1 and HWC4 DLA formats are both reported as TensorFormat::kDLA_HWC4.
  • For transformer decoder based models (such as GPT2) with sequence length as dynamic, TensorRT 8.5 requires additional workspace (up to 2x) as compared to previous releases.
Performance
  • There is a ~12% performance drop on NVIDIA Ampere architecture GPUs for the BERT network on Windows systems.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 22% performance drop for Jasper networks compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 5% performance drop for the InceptionV4 network compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is an up to 20% performance variation between different engines built from the same network for some LSTM networks due to unstable tactic selections.
  • There is an up to 11% performance variation for some LSTM networks during inference depending on the order of CUDA stream creation on NVIDIA Turing GPUs. This will be fixed in r525 drivers.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is an up to 10-11% performance regression on Xavier compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 5% performance drop for Megatron networks in FP32 precision at batch-size = 1 between CUDA 11.8 and CUDA 10.2 on NVIDIA Volta GPUs. This performance drop does not happen on NVIDIA Turing or later GPUs.
  • There is an up to 23% performance drop between H100 and A100 for some ConvNets in TF32 precision when running at the same SM clock frequency. This will be improved in future TRT versions.
  • There is an up to 8% performance drop between H100 and A100 for some transformers, including BERT, BART, T5, and GPT2, in FP16 precision at BS=1 when running at the same SM clock frequency. This will be improved in future TensorRT versions.
  • H100 performance for some ConvNets in TF32 precision is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for ResNeXt-50 QAT networks in INT8, FP16, and FP32 precision at batch-size = 1 compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • H100 performance for some Transformers in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some ConvNets containering depthwise convolutions (like QuartzNets and EfficientDet-D0) in INT8 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.
  • H100 performance for some 3DUnets is not fully optimized. This will be improved in future TensorRT versions.
  • There is an up to 6% performance drop for OpenRoadNet networks in TF32 precision compared to TensorRT 8.4 on NVIDIA Ampere architecture GPUs.
  • There is an up to 6% performance drop for T5 networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs due to a functionality fix.
  • There is an up to 5% performance drop for UNet networks in INT8 precision with explicit quantization on CUDA 11.x compared to CUDA 10.2 on NVIDIA Turing GPUs.
  • There is an up to 6% performance drop for WaveRNN networks in FP16 precision compared to TensorRT 8.4 on CUDA 11.8 on NVIDIA Volta GPUs. Downgrading CUDA to CUDA 11.6 fixes the issue.
  • There is an up to 13% performance drop for Megatron networks in FP16 precision on Tesla T4 GPUs when disableExternalTacticSourcesForCore0805 is enabled.
  • There is an up to 16% performance drop for LSTM networks in FP32 precision compared to TensorRT 8.4 on NVIDIA Pascal GPUs.
  • There is an up to 17% performance drop for LSTM on Windows in FP16 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • There is an up to 7% performance drop for Artifact Reduction networks involving Deconvolution ops in INT8 precision compared to TensorRT 8.4 on NVIDIA Volta GPUs.
  • With the kFASTER_DYNAMIC_SHAPES_0805 preview feature enabled on the GPT style decoder models, there can be an up to 20% performance regression for odd sequence lengths only compared to TensorRT without the use of the preview feature.

2.6. TensorRT Release 8.4.3

These are the TensorRT 8.4.3 Release Notes and is applicable to x86 Linux and Windows users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all non-batch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • The builder may require up to 60% more memory to build an engine.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or later).
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in NVIDIA JetPack 4.5 for ResNet-like networks on NVIDIA DLA on Xavier platforms when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues. NVIDIA Orin platforms are not affected by this.

Fixed Issues

  • When parsing networks with ONNX operand expand on scalar input. TensorRT would error out. This issue has been fixed in this release.
  • The custom ClipPlugin used in the uff_custom_plugin sample had an issue with a plugin parameter not being serialized, leading to a failure when the plugin needed to be deserialized. This issue has been fixed with proper serialization/deserialization.
  • When working with transformer based networks with multiple dynamic dimensions, if the network had shuffle operations which caused one or more dimensions to be a coalesced dimension (combination of multiple dynamic dimensions) and if this shuffle was further used in a reduction operation such as MatrixMultiply layer, it can potentially lead to corruption of results. This issue has been fixed in this release.
  • When working with recurrent networks containing Loops and Fill layers, it was possible that the engine may have failed to build. This issue has been fixed in this release.
  • In some rare cases when converting a MatrixMultiply layer to a Convolution layer for optimization purposes, the shapes may fail to inference. This issue has been fixed in this release.
  • In some cases, Tensor memory was not zero initialized for vectorized dimensions. This resulted in NaN in the output tensor during engine execution. This issue has been fixed in this release.
  • For the HuggingFace demos, the T5-3B model had only been verified on A100, and was not expected to work on A10, T4, and so on. This issue has been fixed in this release.
  • Certain spatial dimensions may have caused crashes during DLA optimization for models using single-channel inputs. This issue has been fixed in this release.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may have created an internal error causing the application to crash while building the engine. This issue has been fixed in this release.
  • For some networks using sparsity, TensorRT may have produced inaccurate results. This issue has been fixed in this release.

Announcements

  • CUDA 11.7 added a feature called Lazy loading, however, this feature is not supported by TensorRT 8.4 because the CUDA 11.x binaries were built with CUDA Toolkit 11.6.

Known Issues

Functional
  • When performing an L2_Normalization in float16 precision, there is undefined behavior occurring from a fusion. This fusion can be disabled by marking the input to the L2_Normalization as a network output.
  • When performing PTQ with TensorRT with tensors rank > 4, some layers may cause an assertion about invalid Region Dims. This can be worked around by fusing the index layers into the 4th dimension to have the tensor have a rank 4.
  • SM75 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes which quantize the failing layers.
  • When the TensorRT static library is used to build engines and the NVPTXCompiler static library is also used outside of the TensorRT core library at the same time, it is possible to trigger a crash of the process in rare cases.
  • TensorRT should only allow up to a total of 16 I/O tensors for a single subnetwork offloaded to DLA. However, there is a leak in the logic that incorrectly allows > 16 I/O tensors. You may need to manually specify the per layer device to avoid the creation of subnetworks with over 16 I/O tensors, for successful engine construction. This restriction will be properly reinstated in a future release.
  • One of the deconvolution algorithms sourced from cuDNN exhibits non-deterministic execution. Disabling cuDNN tactics will prevent this algorithm from being chosen (refer to IBuilderConfig::setTacticSources).
  • Due to ABI compatibility issues, static builds are not supported on SBSA platforms.
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known issue when ProfilingVerbosity is set to kDETAILED, the enqueueV2() call may take up to 2ms compared to ProfilingVerbosity=kNONE or kLAYER_NAMES_ONLY.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fall back to using cuBLAS. (not applicable for Jetson platforms)
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
Performance
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 22% performance drop for Jasper networks compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 5% performance drop for the InceptionV4 network compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 5% performance drop for the ShuffleNet network compared to TensorRT 8.2 when running in INT8 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere Architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is an up to 10% performance difference for the WaveRNN network between different operating systems when running in FP16 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is an up to 7% performance regression for the 3D-UNet networks compared to TensorRT 8.4 EA when running in INT8 precision on NVIDIA Orin due to a functionality fix.
  • There is an up to 20% performance variation between different engines built from the same network for some LSTM networks when running on Windows due to unstable tactic selections.
  • Some networks may see a small increase in deserialization time.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and NVIDIA Pascal GPUs.
  • There is an up to 10-11% performance regression on Xavier compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.

2.7. TensorRT Release 8.4.2

These are the TensorRT 8.4.2 Release Notes and is applicable to x86 Linux and Windows users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added samples:
    • tensorflow_object_detection_api, which demonstrates the conversion and execution of the Tensorflow Object Detection API Model Zoo models with TensorRT. For information about how this sample works, sample code, and step-by-step instructions on how to run and verify its output, refer to the GitHub: tensorflow_object_detection_api/README.md file.
    • detectron2, which demonstrates the conversion and execution of the Detectron 2 Model Zoo Mask R-CNN R50-FPN 3x model with TensorRT. For information about how this sample works, sample code, and step-by-step instructions on how to run and verify its output, refer to the GitHub: detectron2/README.md file.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all non-batch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • The builder may require up to 60% more memory to build an engine.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or later).
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in NVIDIA JetPack 4.5 for ResNet-like networks on NVIDIA DLA on Xavier platforms when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues. NVIDIA Orin platforms are not affected by this.

Fixed Issues

  • The standalone Python wheel files for TensorRT 8.4.1 were much larger than necessary. We have removed some duplication within the Python wheel files, which has resulted in a file size reduction.
  • When parsing networks with random fill nodes defined within conditionals, TensorRT would error out. The issue has been fixed and these networks can now successfully compile.
  • When using multiple Convolution layers using the same input and wrapped with Q/DQ layers, TensorRT could have produced inaccurate results. This issue has been fixed in this release.
  • Calling a Max Reduction on a shape tensor that has a non-power of two volumes of index dimensions could produce undefined results. This issue has been fixed in this release.
  • There was a known regression with the encoder model. The encoder model could be built successfully with TensorRT 8.2 but would fail with TensorRT 8.4. This issue has been fixed in this release.
  • An assertion error occurred when the constant folding of Boolean type for the slice operation was not enabled. The constant folding of Boolean type for the slice op is now enabled.
  • Parsing ONNX models with conditional nodes that contained the same initializer names would sometimes produce incorrect results. This issue has been fixed in this release.
  • An assertion in TensorRT occurred when horizontally fusing Convolution or Matrix Multiplication operations that have weights in different precisions. This issue has been fixed in this release.
  • Certain models including but not limited to those with loops or conditionals were susceptible to an allocation-related assertion failure due to a race condition. This issue has been fixed in this release.
  • When using the IAlgorithmSelector interface, if BuildFlag::kREJECT_EMPTY_ALGORITHMS was not set, an assertion occurred where the number of algorithms is zero. This issue has been fixed in this release.

Announcements

  • CUDA 11.7 added a feature called Lazy loading, however, this feature is not supported by TensorRT 8.4 because the CUDA 11.x binaries were built with CUDA Toolkit 11.6.

Known Issues

Functional
  • When performing an L2_Normalization in float16 precision, there is undefined behavior occurring from a fusion. This fusion can be disabled by marking the input to the L2_Normalization as a network output.
  • When performing PTQ with TensorRT with tensors rank > 4, some layers may cause an assertion about invalid Region Dims. This can be worked around by fusing the index layers into the 4th dimension to have the tensor have a rank 4.
  • SM75 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes which quantize the failing layers.
  • For some networks using sparsity, TensorRT may produce inaccurate results.
  • When the TensorRT static library is used to build engines and the NVPTXCompiler static library is also used outside of the TensorRT core library at the same time, it is possible to trigger a crash of the process in rare cases.
  • TensorRT should only allow up to a total of 16 I/O tensors for a single subnetwork offloaded to DLA. However, there is a leak in the logic that incorrectly allows > 16 I/O tensors. You may need to manually specify the per layer device to avoid the creation of subnetworks with over 16 I/O tensors, for successful engine construction. This restriction will be properly reinstated in a future release.
  • One of the deconvolution algorithms sourced from cuDNN exhibits non-deterministic execution. Disabling cuDNN tactics will prevent this algorithm from being chosen (refer to IBuilderConfig::setTacticSources).
  • Some models may fail on SBSA platforms when using statically linked binaries.
  • For the HuggingFace demos, the T5-3B model has only been verified on A100, and is not expected to work on A10, T4, and so on.
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known issue when ProfilingVerbosity is set to kDETAILED, the enqueueV2() call may take up to 2ms compared to ProfilingVerbosity=kNONE or kLAYER_NAMES_ONLY.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fall back to using cuBLAS. (not applicable for Jetson platforms)
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
  • For some networks, using batch sizes larger than 32 may cause accuracy degradation on DLA.
  • Certain spatial dimensions may cause crashes during DLA optimization for models using single-channel inputs.
Performance
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 22% performance drop for Jasper networks compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 5% performance drop for the InceptionV4 network compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 5% performance drop for the ShuffleNet network compared to TensorRT 8.2 when running in INT8 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere Architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is an up to 10% performance difference for the WaveRNN network between different operating systems when running in FP16 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is an up to 7% performance regression for the 3D-UNet networks compared to TensorRT 8.4 EA when running in INT8 precision on NVIDIA Orin due to a functionality fix.
  • There is an up to 20% performance variation between different engines built from the same network for some LSTM networks when running on Windows due to unstable tactic selections.
  • Some networks may see a small increase in deserialization time.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and NVIDIA Pascal GPUs.
  • There is an up to 10-11% performance regression on Xavier compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.

2.8. TensorRT Release 8.4.1

These are the TensorRT 8.4.1 Release Notes and is applicable to x86 Linux and Windows users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT releases as well as the following additional changes.

These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features and Enhancements

This TensorRT release includes the following key features and enhancements.
  • Added sampleOnnxMnistCoordConvAC, which contains custom CoordConv layers. It converts a model trained on the MNIST dataset in ONNX format to a TensorRT network and runs inference on the network. Scripts for generating the ONNX model are also provided.
  • The DLA slice layer is supported for DLA 3.9.0 and later. The DLA SoftMax layer is supported for DLA 1.3.8.0 and later. Refer to DLA Supported Layers for more information.
  • Added a Revision History section to the NVIDIA TensorRT Developer Guide to help identify content that’s been added or updated since the previous release.
  • Reduced engine file size and runtime memory use on some networks with large spatial dimension convolution or deconvolution layers.
  • The following C++ API functions and enums were added:
  • The following Python API functions and enums were added:
  • Improved the performance of some convolutional neural networks trained in TensorFlow and exported using the tf2onnx tool or running in TF-TRT.
  • Added the --layerPrecisions and --layerOutputTypes flags to the trtexec tool to allow you to specify layer-wise precision constraints and layer-wise output type constraints.
  • Added the --memPoolSize flag to the trtexec tool to allow you to specify the size of the workspace as well as the DLA memory pools using a unified interface.
  • Added a new interface to customize and query the sizes of the three DLA memory pools: managed SRAM, local DRAM, and global DRAM. For consistency with past behavior, the pool sizes apply per-subgraph (that is, per-loadable). Upon loadable compilation success, the builder reports the actual amount of memory used per pool by each loadable, thus allowing for fine-tuning; upon failure due to insufficient memory a message will be emitted.

    There are also changes outside the scope of DLA: the existing API to specify and query the workspace size (setMaxWorkspaceSize, getMaxWorkspaceSize) has been deprecated and integrated into the new API. Also, the default workspace size has been updated to the device-global memory size, and the TensorRT samples have had their specific workspace sizes removed in favor of the new default value. Refer to Customizing DLA Memory Pools for more information.

  • Added support for NVIDIA BlueField®-2 data processing units (DPUs), both A100X and A30X variants when using the Arm Server Base System Architecture (SBSA) packages.
  • Added support for NVIDIA JetPack 5.0 users. NVIDIA Xavier and NVIDIA Orin™ based devices are supported.
  • Added support for the dimensions labeled with the same subscript in IEinsumLayer to be broadcastable.
  • Added-asymmetric padding support for 3D or dilated deconvolution layers on sm70+ GPUs, when the accumulation of kernel size is equal to or less than 32.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:
    • the first mode triggers when all non-batch, non-axis dimensions are 1, and
    • the second mode triggers in other cases if valid.

    The second of the two modes is supported only for DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later. Refer to DLA Supported Layers for more information.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
  • You may encounter an error such as, "Unable to load library: nvinfer_builder_resource.dll", if using Python 3.9.10 on Windows. You can workaround this issue by downgrading to an earlier version of Python 3.9.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • The builder may require up to 60% more memory to build an engine.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or later).
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in NVIDIA JetPack 4.5 for ResNet-like networks on NVIDIA DLA on Xavier platforms when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues. NVIDIA Orin platforms are not affected by this.

Deprecated and Removed Features

The following features are deprecated in TensorRT 8.4.1:
  • Removed sampleNMT.
  • The End-To-End Host Latency metric in trtexec output has been removed to avoid confusion. Use the “Host Latency” metric instead for performance metric. For more information, refer to Benchmarking Network.
  • CentOS Linux 8 has reached End-of-Life on Dec 31, 2021. Support for this OS will be deprecated in the next TensorRT release. CentOS Linux 8 support will be completely removed in a future release.
  • In previous TensorRT releases, PDF documentation was included inside the TensorRT package. The PDF documentation has been removed from the package in favor of online documentation, which is updated regularly. Online documentation can be found at https://docs.nvidia.com/deeplearning/tensorrt/index.html.
  • The TensorRT shared library files no longer have RUNPATH set to $ORIGIN. This setting was causing unintended behavior for some users. If you relied on this setting before you may have trouble with missing library dependencies when loading TensorRT. It is preferred that you manage your own library search path using LD_LIBRARY_PATH or a similar method.

Fixed Issues

  • If the TensorRT Python bindings were used without a GPU present, such as when the NVIDIA Container Toolkit is not installed or enabled before running Docker, then you may have encountered an infinite loop that required the process to be killed in order to terminate the application. This issue has been fixed in this release.
  • The EngineInspector detailed layer information always showed batch size = 1 when the engine was built with implicit batch dimensions. This issue has been fixed in this release.
  • The IElementWiseLayer and IUnaryLayer layers can accept different input datatypes depending on the operation that is used. The documentation was updated to explicitly show which datatypes are supported. For more information, refer to IElementWiseLayer and IUnaryLayer.
  • When running ONNX models with dynamic shapes, there was a potential accuracy issue if the dimension names of the inputs that were expected to be the same were not. For example, if a model had two 2D inputs of which the dimension semantics were both batch and seqlen, and in the ONNX model, the dimension name of the two inputs were different, there was a potential accuracy issue when running with dynamic shapes. This issue has been fixed in this release.
  • There was an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer. This regression has been fixed in this release.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior. This issue has been fixed in this release. TensorRT does not archive public libnvptxcompiler_static.a and libnvrtc_static.a into libnvinfer_static.a.
  • There was an up to 10% performance regression for ResNeXt networks with small batch (1 or 2) in FP32 compared to TensorRT 6 on Xavier. This regression has been fixed in this release.
  • TensorRT could have experienced some instability when running networks containing TopK layers on T4 under Azure VM. This issue has been fixed in this release.
  • There was a potential memory leak while running models containing the Einsum op. This issue has been fixed in this release.
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect was that the TensorRT optimizer was able to choose layer implementations that use more memory, which could cause the OOM Killer to trigger for networks where it previously did not. This issue has been fixed in this release.
  • TensorRT had limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and FullyConnected layers had to be fused. Therefore, if a weights shuffle pattern was not supported, it may lead to failure to quantize the layer. This issue has been fixed in this release
  • Networks that used certain pointwise operations not preceded by convolutions or deconvolutions and followed by slicing on spatial dimensions could crash in the optimizer. This issue has been fixed in this release.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may have encountered Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The README for these samples have been updated with instructions on how to install a GPU enabled version of PyTorch.
  • Intermittent accuracy issues were observed in sample_mnist with INT8 precision on WSL2. This issue has been fixed in this release.
  • The TensorRT plug-ins library used a logger that was not thread-safe and that could cause data races. This issue has been fixed in this release.
  • For a quantized (QAT) network with ConvTranspose followed by BN, ConvTranspose would be quantized first and then BN would be fused to ConvTranspose. This fusion was wrong and caused incorrect outputs. This issue has been fixed in this release.
  • During the graph optimization, new nodes were added but there was no mechanism preventing the duplication of node names. This issue has been fixed in this release.
  • For some networks with large amounts of weights and activation data, TensorRT failed compiling a subgraph, and that subgraph would fallback to GPU. Now, rather than the whole subgraph fallback to GPU, only the single node that cannot be run with DLA will fallback to GPU.
  • There was a known functional issue when running networks containing 3D deconvolution layers on L4T. This issue has been fixed in this release.
  • There was a known functional issue when running networks containing convolution layers on K80. This issue has been fixed in this release.
  • A small portion of the data of the inference results of the LSTM graph of a specific pattern was non-deterministic occasionally. This issue has been fixed in this release.
  • A small portion of the LSTM graph, in which multiple MatMul layers have opA/opB==kTRANSPOSE consuming the same input tensor, may have failed to build the engine. This issue has been fixed in this release.
  • For certain networks built for the Xavier GPU, the deserialized engine may have allocated more GPU memory than necessary. This issue has been fixed in this release.
  • If a network had a Gather layer with both indices and input dynamic, and the optimization profile had a large dynamic range (difference between max and min), TensorRT could request a very large workspace. This issue has been fixed in this release.

Announcements

  • Python support for Windows included in the zip package is ready for production use.
  • CUDA 11.7 added a feature called Lazy loading, however, this feature is not supported by TensorRT 8.4 because the CUDA 11.x binaries were built with CUDA Toolkit 11.6.

Known Issues

Functional
  • Calling a Max Reduction on a shape tensor that has a non-power of two volumes of index dimensions can produce undefined results. This can be fixed by padding the index dimensions to have a volume equal to a power of two.
  • When performing an L2_Normalization in float16 precision, there is undefined behavior occurring from a fusion. This fusion can be disabled by marking the input to the L2_Normalization as a network output.
  • When performing PTQ with TensorRT with tensors rank > 4, some layers may cause an assertion about invalid Region Dims. This can be worked around by fusing the index layers into the 4th dimension to have the tensor have a rank 4.
  • SM75 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes which quantize the failing layers.
  • For some networks using sparsity, TensorRT may produce inaccurate results.
  • When the TensorRT static library is used to build engines and the NVPTXCompiler static library is also used outside of the TensorRT core library at the same time, it is possible to trigger a crash of the process in rare cases.
  • TensorRT should only allow up to a total of 16 I/O tensors for a single subnetwork offloaded to DLA. However, there is a leak in the logic that incorrectly allows > 16 I/O tensors. You may need to manually specify the per layer device to avoid the creation of subnetworks with over 16 I/O tensors, for successful engine construction. This restriction will be properly reinstated in a future release.
  • When using multiple Convolution layers using the same input and wrapped with Q/DQ layers, TensorRT may produce inaccurate results
  • One of the deconvolution algorithms sourced from cuDNN exhibits non-deterministic execution. Disabling cuDNN tactics will prevent this algorithm from being chosen (refer to IBuilderConfig::setTacticSources).
  • Some models may fail on SBSA platforms when using statically linked binaries.
  • For the HuggingFace demos, the T5-3B model has only been verified on A100, and is not expected to work on A10, T4, and so on.
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • There is a known issue when ProfilingVerbosity is set to kDETAILED, the enqueueV2() call may take up to 2ms compared to ProfilingVerbosity=kNONE or kLAYER_NAMES_ONLY.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fall back to using cuBLAS. (not applicable for Jetson platforms)
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
  • The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
  • For some networks, using batch sizes larger than 32 may cause accuracy degradation on DLA.
  • Certain spatial dimensions may cause crashes during DLA optimization for models using single-channel inputs.
Performance
  • There is a known regression with the encoder model. The encoder model can be built successfully with TensorRT 8.2 but fails with TensorRT 8.4.
  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).
  • There is an up to 22% performance drop for Jasper networks compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 5% performance drop for the InceptionV4 network compared to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used instead.
  • There is an up to 27% performance drop for BART compared to TensorRT 8.2 when running with both FP16 and INT8 precisions enabled on T4. This performance drop can be fixed by disabling the INT8 precision flag.
  • There is an up to 5% performance drop for the ShuffleNet network compared to TensorRT 8.2 when running in INT8 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is an up to 10% performance drop for the SegResNet network compared to TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere Architecture GPUs due to a cuDNN regression in the InstanceNormalization plug-in. This will be fixed in a future TensorRT release. You can work around the regression by reverting the cuDNN version to cuDNN 8.2.1.
  • There is an up to 10% performance difference for the WaveRNN network between different operating systems when running in FP16 precision on NVIDIA Ampere Architecture GPUs. This will be fixed in a future TensorRT release.
  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.
  • There is an up to 7% performance regression for the 3D-UNet networks compared to TensorRT 8.4 EA when running in INT8 precision on NVIDIA Orin due to a functionality fix.
  • There is an up to 20% performance variation between different engines built from the same network for some LSTM networks when running on Windows due to unstable tactic selections.
  • Some networks may see a small increase in deserialization time.
  • Due to the difference in DLA hardware specification between NVIDIA Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on NVIDIA Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on NVIDIA Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on NVIDIA Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.
  • For transformer-based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on NVIDIA Turing GPUs.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and NVIDIA Pascal GPUs.
  • There is an up to 10-11% performance regression on Xavier compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.

2.9. TensorRT Release 8.4.0 Early Access (EA)

These are the TensorRT 8.4.0 Early Access (EA) Release Notes and are applicable to x86 Linux and Windows users. This release incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes.

These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

This EA release is for early testing and feedback. For production use of TensorRT, continue to use TensorRT 8.2.3 or later TensorRT 8.2.x patch.
Note: TensorRT 8.4 EA does not include updates to the CUDA network repository. You should use the local repo installer package instead.

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
  • Reduce engine file size and runtime memory usage on some networks with large spatial dimension convolution or deconvolution layers.
  • The following C++ API functions and enums were added:
  • The following Python API functions and enums were added:
  • Improved the performance of some convolutional neural networks trained in TensorFlow and exported using the tf2onnx tool or running in TF-TRT.
  • Added the --layerPrecisions and --layerOutputTypes flags to the trtexec tool to allow you to specify layer-wise precision constraints and layer-wise output type constraints.
  • Added the --memPoolSize flag to the trtexec tool to allow you to specify the size of the workspace as well as the DLA memory pools via a unified interface.
  • Added a new interface to customize and query the sizes of the three DLA memory pools: managed SRAM, local DRAM, and global DRAM. For consistency with past behavior, the pool sizes apply per-subgraph (i.e. per-loadable). Upon loadable compilation success, the builder reports the actual amount of memory used per pool by each loadable, thus allowing for fine-tuning; upon failure due to insufficient memory a message will be emitted.

    There are also changes outside the scope of DLA: the existing API to specify and query the workspace size (setMaxWorkspaceSize, getMaxWorkspaceSize) has been deprecated and integrated into the new API. Also, the default workspace size has been updated to the device global memory size, and the TensorRT samples have had their specific workspace sizes removed in favor of the new default value. Refer to Customizing DLA Memory Pools for more information.

  • Added support for NVIDIA BlueField®-2 data processing units (DPUs), both A100X and A30X variants when using the ARM Server Base System Architecture (SBSA) packages.
  • Added support for NVIDIA JetPack 5.0 users. NVIDIA Xavier and NVIDIA Orin™ based devices are supported.
  • Added support for the dimensions labeled with the same subscript in IEinsumLayer to be broadcastable.
  • Added asymmetric padding support for 3D or dilated deconvolution layers on sm70+ GPUs, when the accumulation of kernel size is equal to or less than 32.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
  • APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • When static linking with cuDNN, cuBLAS, and cuBLASLt libraries, TensorRT requires CUDA >=11.3.
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • 3D Asymmetric Padding is not supported on GPUs older than the NVIDIA Volta GPU architecture (compute capability 7.0).

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.4.0 EA:
  • The following C++ API functions and classes were deprecated:
    • IFullyConnectedLayer
    • getMaxWorkspaceSize
    • setMaxWorkspaceSize
  • The following Python API functions and classes were deprecated:
    • IFullyConnectedLayer
    • get_max_workspace_size
    • set_max_workspace_size
  • The --workspace flag in trtexec has been deprecated. TensorRT now allocates as much workspace as available GPU memory by default when the --workspace/--memPoolSize flags are not added, instead of having 16MB default workspace size limit in the trtexec in TensorRT 8.2. To limit the workspace size, use the --memPoolSize=workspace:<size> flag instead.
  • The IFullyConnectedLayer operation is deprecated. Typically, you should replace it with IMatrixMultiplyLayer. The MatrixMultiply layer does not support all data layouts supported by the FullyConnected layer currently, so additional work may be required when using BuilderFlag::kDIRECT_IO, if the input of the MatrixMultiply layer is a network I/O tensor:
    • If the MatrixMultiply layer is forced to INT8 precision via a combination of
      • ILayer::setPrecision(DataType::kINT8)
      • IBuilderConfig::setFlag(BuilderFlag::kOBEY_PRECISION_CONSTRAINTS)
      the engine will fail to build.
    • If the MatrixMultiply layer is prefered to run on DLA and GPU fallback is allowed via a combination of
      • IBuilderConfig->setDeviceType(matrixMultiplyLayer,
            DeviceType::kDLA)
      • IBuilderConfig->setFlag(BuilderFlag::kGPU_FALLBACK)
      the layer will fall back to run on the GPU.
    • If the MatrixMultiply layer is required to run on DLA and GPU fallback is not allowed via
      • IBuilderConfig->setDeviceType(matrixMultiplyLayer,
            DeviceType::kDLA)
      the engine will fail to build.

      To resolve these issues, either relax one of the constraints, or use IConvolutionLayer to create a Convolution 1x1 layer to replace IFullyConnectedLayer.

      Refer to the MNIST API samples (C++, Python) for examples of migrating from IFullyConnectedLayer to IMatrixMultiplyLayer.

Fixed Issues

  • The EngineInspector detailed layer information always showed batch size = 1 when the engine was built with implicit batch dimension. This issue has been fixed in this release.
  • The IElementWiseLayer and IUnaryLayer layers can accept different input datatypes depending on the operation that is used. The documentation was updated to explicitly show which datatypes are supported. For more information, refer to IElementWiseLayer and IUnaryLayer.
  • When running ONNX models with dynamic shapes, there was a potential accuracy issue if the dimension names of the inputs that were expected to be the same were not. For example, if a model had two 2D inputs of which the dimension semantics were both batch and seqlen, and in the ONNX model, the dimension name of the two inputs were different, there was a potential accuracy issue when running with dynamic shapes. This issue has been fixed in this release.
  • There was an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer. This regression has been fixed in this release.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

Known Issues

Functional
  • There is a known functional issue when running networks containing 3D deconvolution layers on L4T.
  • There is a known functional issue when running networks containing convolution layers on K80.
  • A small portion of the data of the inference results of the LSTM graph of a specific pattern is non-deterministic occasionally.
  • If a network has a Gather layer with both indices and input dynamic and the optimization profile has a large dynamic range (difference between max and min), TensorRT could request a very large workspace.
  • For the HuggingFace demos, the T5-3B model has only been verified on A100, and is not expected to work on A10, T4, etc.
  • For a quantized (QAT) network with ConvTranspose followed by BN, ConvTranspose will be quantized first and then BN will be fused to ConvTranspose. This fusion is wrong and causes incorrect outputs.
  • A small portion of the LSTM graph, in which multiple MatMul layers have opA/opB==kTRANSPOSE consuming the same input tensor, may fail to build the engine.
  • During the graph optimization, new nodes are added but there is no mechanism preventing the duplication of node names.
  • TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
  • TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
  • The TensorRT plugins library uses a logger that is not thread-safe which can cause data races
  • There is a potential memory leak while running models containing the Einsum op.
  • There is a known issue when ProfilingVerbosity is set to kDETAILED, the enqueueV2() call may take up to 2ms compared to ProfilingVerbosity=kNONE or kLAYER_NAMES_ONLY.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • You may see the following error:
    Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
  • For some networks, using batch sizes larger than 32 may cause accuracy degradation on DLA.
  • Certain spatial dimensions may cause crashes during DLA optimization for models using single-channel inputs.
  • Networks that use certain pointwise operations not preceded by convolutions or deconvolutions and followed by slicing on spatial dimensions may crash in the optimizer.
  • The builder may require up to 60% more memory to build an engine.
  • If the TensorRT Python bindings are used without a GPU present, such as when the NVIDIA Container Toolkit is not installed or enabled before running Docker, then you may encounter an infinite loop which requires the process to be killed in order to terminate the application.
Performance
  • For certain networks built for the Xavier GPU, the deserialized engine may allocate more GPU memory than necessary.
  • Some networks may see a small increase in deserialization time.
  • Due to the difference in DLA hardware specification between Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on Orin are expected to be about 40% slower than on Xavier.
  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package release notes for details.
  • For transformer based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
  • There is an up to 17% performance regression for DeepASR networks at BS=1 on Turing GPUs.
  • If a Pointwise operation has 2 inputs, then a fusion may not be possible leading to lower performance. For example, MatMul and Sigmoid can typically be fused to ConvActFusion but not in this scenario.
  • There is an up to 15% performance regression for MaskRCNN-ResNet-101 on Turing GPUs in INT8 precision.
  • There is an up to 23% performance regression for Jasper networks on Volta and Turing GPUs in FP32 precision.
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.

2.10. TensorRT Release 8.2.5

These are the TensorRT 8.2.5 Release Notes and are applicable to x86 Linux and Windows users. This release incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT release as well as the following additional changes.

These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Fixed Issues

  • There is a fast configuration for the SoftMax kernel which was not enabled previously when porting it from cuDNN. This performance regression has been fixed in this release.
  • The Scale kernel had previously incorrectly supported strides (due to concat and slice elision). This issue has been fixed with this release.

Known Issues

Functional
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
Performance
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

2.11. TensorRT Release 8.2.4

These are the TensorRT 8.2.4 Release Notes and are applicable to x86 Linux and Windows users. This release incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT release as well as the following additional changes.

These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Fixed Issues

  • UBSan issues had not been discussed in the documentation. We’ve added a new section that discusses Issues With Undefined Behavior Sanitizer in the NVIDIA TensorRT Developer Guide.
  • There was a functional issue when horizontal merge is followed by a concat layer with axis on non-channel dimension. The issue is fixed in this release.
  • For a network with floating-point output, when the configuration allows using INT8 in the engine, TensorRT has a heuristic for avoiding excess quantization noise in the output. Previously, the heuristic assumed that plugins were capable of floating-point output if needed, and otherwise the engine failed to build. Now, the engine will build, although without trying to avoid quantization noise from an INT8 output from a plugin. Furthermore, a plugin with an INT8 output that is connected to a network output of type INT8 now works.
  • TensorRT was incorrectly computing the size of tensors when doing memory allocations computation. This occurred in cases where dynamic shapes was triggering an integer overflow on the max opt dimension when accumulating the volumes of all the network I/O tensors.
  • TensorRT incorrectly performed horizontal fusion of batched Matmuls along the batch dimension. The issue is fixed in this release.
  • In some cases TensorRT failed to find a tactic for Pointwise. The issue is fixed in this release.
  • There was a functional issue in fused reduction kernels which would lead to an accuracy drop in WeNet transformer encoder layers. The issue is fixed in this release.
  • There were functional issues when two layer's inputs (or outputs) shared the same IQuantization/IDequantization layer. The issue is fixed in this release.
  • When fusing Convolution+Quantization or Pointwise+Quantization and the output type is constrained to INT8, you had to specify the precision for Convolution and Pointwise operations for other fusions to work correctly. If the Convolution and Pointwise precision had not been configured yet, it would have to be float because INT8 precision requires explicitly fusing with Dequantization. The issue is fixed in this release.
  • There was a known crash when building certain large GPT2-XL model variants. The issue is fixed in this release.

Known Issues

Functional
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    "Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory"
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
Performance
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

2.12. TensorRT Release 8.2.3

This is the TensorRT 8.2.3 release notes and is applicable to x86 Linux and Windows users, as well as incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Fixed Issues

  • There was a known issue where using custom allocator and allocation resizing introduced in TensorRT 8 that would trigger an assert about a p.second failure. This was caused by the application passing to TensorRT the same exact pointer from the re-allocation routine. This assertion has been fixed to be a valid use case.
  • There was an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer. This issue has been fixed in this release.
  • There was an up to 20% performance regression for INT8 QAT networks with a Padding layer located before the Q/DQ and Convolution layer. This issue has been fixed in this release.
  • TensorRT does not support explicit quantization (i.e. Q/DQ) for batched matrix-multiplication. This fix introduces support for the special case of quantized batched matrix-multiplication when matrix B is constant and can be squeezed to a 2D matrix. Specifically, in the supported configuration matrix A (the data) can have shape (BS, M, K), where BS is the batch size, and matrix B (the weights) can have shape (1, K, N). The output has shape (BS, M, N) which is computed by broadcasting the weights across the batch dimension. Quantized batched matrix-multiplication has two pairs of Q/DQ nodes that quantize the input data and the weights.
  • An incorrect fusion of two transpose operations caused an assertion to trigger while building the model. This issue has been fixed in this release.

Known Issues

Functional
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
Performance
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

2.13. TensorRT Release 8.2.2

This is the TensorRT 8.2.2 release notes and is applicable to x86 Linux and Windows users, as well as incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

  • APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Fixed Issues

  • In order to install TensorRT using the pip wheel file, you had to ensure that the pip version was less than 20. An older version of pip was required to workaround an issue with the CUDA 11.4 wheel meta packages that TensorRT depends on. This issue has been fixed in this release.
  • An empty directory named deserializeTimer under the samples directory was left in the package by accident. This issue has been fixed in this release.
  • For some transformer based networks built with PyTorch Multi-head Attention API, the performance could have been up to 45% slower than similar networks built with other APIs due to different graph patterns. This issue has been fixed in this release.
  • IShuffleLayer applied to the output of IConstantLayer was incorrectly transformed when the constant did not have type kFLOAT, sometimes causing build failures. This issue has been fixed in this release.
  • ONNX models with MatMul operations that used QuantizeLinear/DequantizeLinear operations to quantize the weights, and pre-transpose the weights (that is, do not use a separate Transpose operation) would suffer from accuracy errors due to a bug in the quantization process. This issue has been fixed in this release.

Known Issues

Functional
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM. To workaround this issue, disable CUBLAS_LT kernels with --tacticSources=-CUBLAS_LT (setTacticSources).
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
Performance
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

2.14. TensorRT Release 8.2.1

This is the TensorRT 8.2.1 release notes and is applicable to x86 Linux and Windows users, as well as incorporates ARM® based CPU cores for Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.
  • WSL (Windows Subsystem for Linux) 2 is released as a preview feature in this TensorRT 8.2.1 GA release.

Deprecated API Lifetime

  • APIs deprecated prior to TensorRT 8.0 will be removed in TensorRT 9.0.
  • APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
  • APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

Limitations

  • DLA does not support hybrid precision for pooling layer – data type of input and output tensors should be the same as the layer precision i.e. either all INT8 or all FP16.

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.2.1:
  • BuilderFlag::kSTRICT_TYPES is deprecated. Its functionality has been split into separate controls for precision constraints, reformat-free I/O, and failure of IAlgorithmSelector::selectAlgorithms. This change enables users who need only one of the subfeatures to build engines without encumbering the optimizer with the other subfeatures. In particular, precision constraints are sometimes necessary for engine accuracy, however, reformat-free I/O risks slowing down an engine with no benefit. For more information, refer to BuilderFlags (C++, Python).
  • The LSTM plugin has been removed. In addition, the Persistent LSTM Plugin section has also been removed from the NVIDIA TensorRT Developer Guide.

Fixed Issues

  • Closed the performance gap between linking with TensorRT static libraries and linking with TensorRT dynamic libraries on x86_64 Linux CUDA-11.x platforms.
  • When building a DLA engine with:
    • networks with less than 4D tensors, some DLA subgraph IO tensors would lack shuffle layers. It would fail compiling the engine. This issue has been fixed in this release.
    • kSUB ElementWise operation whose input has less than 4 dimensions, a scale node was inserted. If the scale cannot run on DLA, it would fail compiling the engine. This issue has been fixed in this release.
  • There was a known issue with the engine_refit_mnist sample on Windows. The fix was to edit the engine_refit_mnist/sample.py source file and move the import model line before the sys.path.insert() line. This issue has been fixed in this release and no edit is needed.
  • There was an up to 22% performance regression compared to TensorRT 8.0 for WaveRNN networks on NVIDIA Volta and NVIDIA Turing GPUs. This issue has been fixed in this release.
  • There was an up to 21% performance regression compared to TensorRT 8.0 for BERT-like networks on NVIDIA Jetson Xavier platforms. This issue has been fixed in this release.
  • There was an up to 25% performance drop for networks using the InstanceNorm plugin. This issue has been fixed in this release.
  • There was an accuracy bug resulting in low mAP score with YOLO-like QAT networks where a QuantizeLinear operator was immediately followed by a Concat operator. This accuracy issue has been fixed in this release.
  • There was a bug in TensorRT 8.2.0 EA where if a shape tensor is used by two different nodes, it can sometimes lead to a functional or an accuracy issue. This issue has been fixed in this release.
  • There was a known 2% accuracy regression with NasNet Mobile network with NVIDIA Turing GPUs. This issue has been fixed in this release. (not applicable for Jetson platforms)
  • There could have been build failures with IEinsumLayer when an input subscript label corresponds to a static dimension in one tensor but dynamic dimension in another. This issue has been fixed in this release.
  • There was an up to 8% performance regression compared to TensorRT 8.0.3 for Cortana ASR on NVIDIA Ampere Architecture GPUs with CUDA graphs and a single stream of execution. This issue has been fixed in this release.
  • Boolean input/output tensors were only supported when using explicit batch dimensions. This has been fixed in this release.
  • There was a possibility of an CUBLAS_STATUS_EXECUTION_FAILED error when running with cuBLAS/cuBLASLt libraries from CUDA 11.4 update 1 and CUDA 11.4 update 2 on Linux-based platforms. This happened only for the use cases where cuBLAS is loaded and unloaded multiple times. The workaround was to add the following environment variable before launching your application:
    LD_PRELOAD=libcublasLt.so:libcublasLt.so your_application

    This issue has been fixed in this release.

  • There was a known ~6% - ~29% performance regression on Google® BERT compared to version 8.0.1.6 on NVIDIA A100 GPUs. This issue has been fixed in this release.
  • There was an up to 6% performance regression compared to TensorRT 8.0.3 for Deep Recommender on Tesla V100, NVIDIA Quadro® GV100, and NVIDIA TITAN V. This issue has been fixed in this release.

Announcements

  • The sample sample_reformat_free_io has been renamed to sample_io_formats, and revised to remove the deprecated flag BuilderFlag::kSTRICT_TYPES. Reformat-free I/O is still available with BuilderFlag::kDIRECT_IO, but generally should be avoided since it can result in a slower than necessary engine, and can cause a build to fail if the target platform lacks the kernels to enable building an engine with reformat-free I/O.
  • The NVIDIA TensorRT Release Notes PDF will no longer be available in the product package after this release. The release notes will still remain available online here.

Known Issues

Functional
  • TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
  • TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM. To workaround this issue, disable CUBLAS_LT kernels with --tacticSources=-CUBLAS_LT (setTacticSources).
  • Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
  • When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
  • There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • For some transformer based networks built with PyTorch Multi-head Attention API, the performance may be up to 45% slower than similar networks built with other APIs due to different graph patterns.
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
  • When installing PyCUDA, NumPy must be installed first and as a separate step:
    python3 -m pip install numpy 
    python3 -m pip install pycuda
    For more information, refer to the NVIDIA TensorRT Installation Guide.
  • When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
  • If an IPluginV2 layer produces kINT8 outputs that are output tensors of an INetworkDefinition that have floating-point type, an explicit cast is required to convert the network outputs back to a floating point format. For example:
    // out_tensor is of type nvinfer1::DataType::kINT8
    auto cast_input = network->addIdentity(*out_tensor);
    cast_input->setOutputType(0, nvinfer1::DataType::kFLOAT);
    new_out_tensor = cast_input->getOutput(0);
  • Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
  • The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
  • You may see the following error:
    Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
            object file: No such file or directory
    after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
  • There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
  • ONNX models with MatMul operations that use QuantizeLinear/DequantizeLinear operations to quantize the weights, and pre-transpose the weights (i.e. do not use a separate Transpose operation) will suffer from accuracy errors due to a bug in the quantization process.
Performance
  • There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
  • There is an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer.
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on NVIDIA Maxwell® and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
  • There is an up to 5% performance drop for networks using sparsity in FP16 precision.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
  • The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

2.15. TensorRT Release 8.2.0 Early Access (EA)

This is the TensorRT 8.2.0 Early Access (EA) release notes and is applicable to x86 Linux and Windows users, as well as ARM Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.

Breaking API Changes

  • Between TensorRT 8.0 EA and TensorRT 8.0 GA the function prototype for getLogger() has been moved from NvInferRuntimeCommon.h to NvInferRuntime.h. You may need to update your application source code if you’re using getLogger() and were previously only including NvInferRuntimeCommon.h. Since the logger is no longer global, calling the method on IRuntime, IRefitter, or IBuilder is recommended instead.

Compatibility

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.2.0 EA:
  • Removed sampleMLP.
  • The enums of ProfilingVerbosity have been updated to show their functionality more explicitly:
    • ProfilingVerbosity::kDEFAULT has been deprecated in favor of ProfilingVerbosity::kLAYER_NAMES_ONLY.
    • ProfilingVerbosity::kVERBOSE has been deprecated in favor of ProfilingVerbosity::kDETAILED.
  • Several flags of trtexec have been deprecated:
    • --explicitBatch flag has been deprecated and has no effect. When the input model is in UFF or in Caffe prototxt format, the implicit batch dimension mode is used automatically; when the input model is in ONNX format, the explicit batch mode is used automatically.
    • --explicitPrecision flag has been deprecated and has no effect. When the input ONNX model contains Quantization/Dequantization nodes, TensorRT automatically uses explicit precision mode.
    • --nvtxMode=[verbose|default|none] has been deprecated in favor of --profilingVerbosity=[detailed|layer_names_only|none] to show its functionality more explicitly.
  • Relocated the content from the Best Practices For TensorRT Performance document to the NVIDIA TensorRT Developer Guide. Removed redundancy between the two documents and updated the reference information. Refer to Performance Best Practices for more information.
  • IPaddingLayer is deprecated in TensorRT 8.2 and will be removed in TensorRT 10.0. Use ISliceLayer to pad the tensor, which supports new non-constant, reflects padding mode and clamp, and supports padding output with dynamic shape.

Fixed Issues

  • Closed the performance gap between linking with TensorRT static libraries and linking with TensorRT dynamic libraries.
  • In the previous release, the TensorRT ARM SBSA cross packages in the CUDA network repository could not be installed because cuDNN ARM SBSA cross packages were not available, which is a dependency of the TensorRT cross packages. The cuDNN ARM SBSA cross packages have been made available, which resolves this dependency issue.
  • There was an up to 6% performance regression compared to TensorRT 7.2.3 for WaveRNN in FP16 on Volta and Turing platforms. This issue has been fixed in this release.
  • There was a known accuracy issue of GoogLeNet variants with NVIDIA Ampere GPUs where TF32 mode was enabled by default on Windows. This issue has been fixed in this release. (not applicable for Jetson platforms)
  • There was an up to 10% performance regression when TensorRT was used with CUDNN 8.1 or 8.2. When CUDNN 8.0 was used, the performance was restored. This issue has been fixed in this release. (not applicable for Jetson platforms)
  • The new Python sample efficientdet was only available in the OSS release. The sample has been added to the core package in this release.
  • The performance of IReduceLayer has been improved significantly when the output size of the IReduceLayer is small.
  • There was an up to 15% performance regression compared to TensorRT 7.2.3 for path perception network (Pathnet) in FP32. This issue has been fixed in this release.

Announcements

  • Python support for Windows included in the zip package is considered a preview release and not ready for production use.

Known Issues

Functional
  • The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
  • For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
  • Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
  • CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
  • On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
  • For some transformer based networks built with PyTorch MultiheadAttention API, the performance may be up to 45% slower than similar networks built with other APIs due to different graph patterns.
  • When building a DLA engine with:
    • networks with less than 4D tensors, some DLA subgraph IO tensors may lack shuffle layers. It will fail compiling the engine.
    • kSUB ElementWise operation whose input has less than 4 dimensions, a scale node is inserted. If the scale cannot run on DLA, it will fail compiling the engine.
  • There is a known 2% accuracy regression with NasNet Mobile network with NVIDIA Turing GPUs. (not applicable for Jetson platforms)
  • TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
  • Boolean input/output tensors are supported only when using explicit batch dimensions.
  • There is a possibility of an CUBLAS_STATUS_EXECUTION_FAILED error when running with cuBLAS/cuBLASLt libraries from CUDA 11.4 update 1 and CUDA 11.4 update 2 on Linux-based platforms. This happens only for the use cases where cuBLAS is loaded and unloaded multiple times. The workaround is to add the following environment variable before launching your application:
    LD_PRELOAD=libcublasLt.so:libcublasLt.so your_application

    This issue will be resolved in a future CUDA 11.4 update.

  • Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
  • There may be build failures with IEinsumLayer when an input subscript label corresponds to a static dimension in one tensor but dynamic dimension in another.
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • TensorRT has limited support for fusing ConstantLayer and ShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
  • There is a known issue with the engine_refit_mnist sample on Windows. To fix the issue, edit the engine_refit_mnist/sample.py source file and move the import model line before the sys.path.insert() line.
  • An empty directory named deserializeTimer under the samples directory was left in the package by accident. This empty directory can be ignored and does not indicate that the package is corrupt or that files are missing. This issue will be corrected in the next release.
  • For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
  • In order to install TensorRT using the pip wheel file, ensure that the pip version is less than 20. An older version of pip is required to workaround an issue with the CUDA 11.4 wheel meta packages that TensorRT depends on. This issue is being worked on and should be resolved in a future CUDA release.
Performance
  • There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
    • up to 12% in FP32 mode. This will be fixed in a future release.
    • up to 10% in FP16 mode on Maxwell and Pascal GPUs.
  • There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on Nano.
  • There is an up to 10-11% performance regression on Xavier:
    • compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
    • compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
  • For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
  • There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
  • There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
  • TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
  • DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
  • The builder may require up to 60% more memory to build an engine.
  • There is a known ~6% - ~29% performance regression on Google BERT compared to version 8.0.1.6 on NVIDIA A100 GPUs.
  • There is an up to 6% performance regression compared to TensorRT 8.0.3 for Deep Recommender on Tesla V100, Quadro GV100, and Titan V.
  • There is an up to 22% performance regression compared to TensorRT 8.0 for WaveRNN networks on Volta and Turing GPUs. This issue is being investigated.
  • There is up to 8% performance regression compared to TensorRT 8.0.3 for Cortana ASR on NVIDIA Ampere GPUs with CUDA graphs and a single stream of execution.
  • There is an up to 21% performance regression compared to TensorRT 8.0 for BERT-like networks on Xavier platforms. This issue is being investigated.
  • There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
  • There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on Volta GPUs.
  • There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.

2.16. TensorRT Release 8.0.3

This is the TensorRT 8.0.3 release notes. This is a bug fix release supporting Linux x86 and Windows users.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Fixed Issues

  • Fixed an invalid fusion assertion problem in the fusion optimization pass.
  • Fixed other miscellaneous issues seen in proprietary networks.
  • Fixed a CUDA 11.4 NVRTC issue during kernel generation on Windows.

2.17. TensorRT Release 8.0.2

This is the TensorRT 8.0.2 release notes. This is the initial release supporting A100 for ARM server users. Only a subset of networks have been validated on ARM with A100. This is a network repository release only.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

2.18. TensorRT Release 8.0.1

This is the TensorRT 8.0.1 release notes and is applicable to x86 Linux and Windows users, as well as PowerPC and ARM Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.

Breaking API Changes

  • Support for Python 2 has been dropped. This means that TensorRT will no longer include wheels for Python 2, and Python samples will not work with Python 2.
  • All API's have been marked as noexcept where appropriate. The IErrorRecorder interface has been fully integrated into the API for error reporting. The Logger is only used as a fallback when the ErrorRecorder is not provided by the user.
  • Callback changes are now marked noexcept, therefore, implementations must also be marked noexcept. TensorRT has never catered to exceptions thrown by callbacks, but this is now captured in the API.
  • Methods that take parameters of type void** where the array of pointers is unmodifiable are now changed to take type void*const*.
  • Dims is now a type alias for class Dims32. Code t