TensorRT Release Notes
NVIDIA TensorRT is a C++ library that facilitates high performance inference on NVIDIA GPUs. It is designed to work in connection with deep learning frameworks that are commonly used for training. TensorRT focuses specifically on running an already trained network quickly and efficiently on a GPU for the purpose of generating a result; also known as inferencing. These release notes describe the key features, software enhancements and improvements, and known issues for the TensorRT 10.7.0 product package.
For previously released TensorRT documentation, refer to the TensorRT Archives.
1. TensorRT Release 10.x.x
1.1. TensorRT Release 10.7.0
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for instructions on updating your code to remove the use of deprecated features.
1.2. TensorRT Release 10.6.0
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for instructions on updating your code to remove the use of deprecated features.
1.3. TensorRT Release 10.5.0
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for instructions on updating your code to remove the use of deprecated features.
1.4. TensorRT Release 10.4.0
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for instructions on updating your code to remove the use of deprecated features.
1.5. TensorRT Release 10.3.0
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for instructions on updating your code to remove the use of deprecated features.
1.6. TensorRT Release 10.2.0
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Key Features and Enhancements
- Added support for normal FP8 Convolutions on Hopper GPUs.
- Added support for FP8 MHA fusion for SeqLen>512 on Hopper GPUs.
- Improved InstanceNorm and GroupNorm fusions for StableDiffusion models.
- Improved DRAM utilization for LayerNorm, pointwise, and data movements (for example, Concats, Slices, Reshapes, Transposes) kernels on GPUs with HBM memory.
- Added new APIs in INetworkDefinition for fine grained
control of refittable weights:
- markWeightsRefittable to mark weights as refittable
- unmarkWeightsRefittable to unmark weights as refittable
- areWeightsMarkedRefittable to query if a weight is marked as refittable
This fine grained refit control is only valid when the new kREFIT_INDIVIDUAL builder flag is used during engine build. It also works with kSTRIP_PLAN, enabling the construction of a weight-stripped engine that can be updated from a fine-tuned checkpoint when all weights are marked as refittable in the fine-tuned checkpoint.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
1.7. TensorRT Release 10.1.0
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Announcements
- The ONNX GraphSurgeon is no longer included inside the TensorRT package. Ensure you remove any previous Debian or RPM packages that may have been installed by a previous release and instead install ONNX GraphSurgeon using pip. For more information, refer to onnx-graphsurgeon.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
1.8. TensorRT Release 10.0.1
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
1.9. TensorRT Release 10.0.0 Early Access (EA)
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
2. TensorRT Release 9.x.x
2.1. TensorRT Release 9.3.0
This GA release is for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only. For applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1 for production use.
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
2.2. TensorRT Release 9.2.0
This GA release is for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only. For applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1 for production use.
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
2.3. TensorRT Release 9.1.0
This GA release is for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only. For applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1 for production use.
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
2.4. TensorRT Release 9.0.1
This GA release is for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only. For applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1 for production use.
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
Fixed Issues
- There was an up to 9% performance regression compared to TensorRT 8.5 on Yolov3 batch size 1 in FP32 precision on NVIDIA Ada Lovelace GPUs. This issue has been fixed.
- There was an up to 13% performance regression compared to TensorRT 8.5 on GPT2 without kv-cache in FP16 precision when dynamic shapes are used on NVIDIA Volta and NVIDIA Ampere GPUs. The workaround was to set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false. This issue has been fixed.
- For transformer-based networks such as BERT and GPT, TensorRT could consume CPU memory up to 2 times the model size during compilation plus any additional formats timed. This issue has been fixed.
- In some cases, when using the OBEY_PRECISION_CONSTRAINTS builder flag and the required type was set to FP32, the network could fail with a missing tactic due to an incorrect optimization converting the output of an operation to FP16. This could be resolved by removing the OBEY_PRECISION_CONSTRAINTS option. This issue has been fixed.
- TensorRT could fail on devices with small RAM and swap space. This could be resolved by ensuring the RAM and swap space is at least 7 times the size of the network. For example, at least 21 GB of combined CPU memory for a 3 GB network. This issue has been fixed.
- If an IShapeLayer was used to get the output shape of an INonZeroLayer, engine building would likely fail. This issue has been fixed.
- If a one-dimension INT8 input was used for a Unary or ElementWise operation, engine building would throw an internal error “Could not find any implementation for node”. This issue has been fixed.
- Using the compute sanitizer racecheck tool could cause the process to be terminated unexpectedly. The root cause was a wrong false alarm. The issue could be bypassed with --kernel-regex-exclude kns=scudnn_winograd. This issue has been fixed.
- TensorRT in FP16 mode did not perform cast operations correctly when only the output types were set, but not the layer precisions. This issue has been fixed.
- There was a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform. This issue has been fixed.
- When using DLA, INT8 convolutions followed by FP16 layers could cause accuracy degradation. In such cases, you could either change the convolution to FP16 or the subsequent layer to INT8. This issue has been fixed.
- Using the compute sanitizer tool from CUDA 12.0 could report a cudaErrorIllegalInstruction error on Hopper GPUs in unusual scenarios. This could be ignored. This issue has been fixed.
- There was an up to 53% engine build time increase for ViT networks in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
- There was an up to 93% engine build time increase for BERT networks in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
- There was an up to 22% performance drop for GPT2 networks in FP16 precision with kv-cache stacked together compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
- There was an up to 28% performance regression compared to TensorRT 8.5 on Transformer networks in FP16 precision on NVIDIA Volta GPUs, and up to 85% performance regression on NVIDIA Pascal GPUs. Disabling the kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 preview flag was a workaround. This issue has been fixed.
- There was an issue on NVIDIA A2 GPUs due to a known hang that can impact multiple networks. This issue has been fixed.
- When using hardware compatibility features, TensorRT would potentially fail while compiling transformer based networks such as BERT. This issue has been fixed.
- The ONNX parser incorrectly used the InstanceNormalization plugin when processing ONNX InstanceNormalization layers with FP16 scale and bias inputs, which led to unexpected NaN and infinite outputs. This issue has been fixed.
3. TensorRT Release 8.x.x
3.1. TensorRT Release 8.6.1
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
3.2. TensorRT Release 8.5.3
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
Announcements
- In the next TensorRT release, cuDNN, cuBLAS, and cuBLASLt tactic sources will be turned off by default in builder profiling. TensorRT plans to remove the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. Use the PreviewFeature flag kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to evaluate the functional and performance impact of disabling cuBLAS and cuDNN and report back to TensorRT if there are critical regressions in your use cases.
- TensorRT Python wheel files before TensorRT 8.5, such as TensorRT 8.4, were published to the NGC PyPI repo. Starting with TensorRT 8.5, Python wheels will instead be published to upstream PyPI. This will make it easier to install TensorRT because it requires no prerequisite steps. Also, the name of the Python package for installation has changed from nvidia-tensorrt to just tensorrt.
- The C++ and Python API documentation in previous releases was included inside the tar file packaging. This release no longer bundles the documentation inside the tar file since the online documentation can be updated post release and avoids encountering mistakes found in stale documentation inside the packages.
3.3. TensorRT Release 8.4.3
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
3.4. TensorRT Release 8.2.5
These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
- APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
- APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
- APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
3.5. TensorRT Release 8.0.3
These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).
This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Notices
Notice
This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.
Arm
Arm, AMBA and Arm Powered are registered trademarks of Arm Limited. Cortex, MPCore and Mali are trademarks of Arm Limited. "Arm" is used to represent Arm Holdings plc; its operating company Arm Limited; and the regional subsidiaries Arm Inc.; Arm KK; Arm Korea Limited.; Arm Taiwan Limited; Arm France SAS; Arm Consulting (Shanghai) Co. Ltd.; Arm Germany GmbH; Arm Embedded Technologies Pvt. Ltd.; Arm Norway, AS and Arm Sweden AB.
HDMI
HDMI, the HDMI logo, and High-Definition Multimedia Interface are trademarks or registered trademarks of HDMI Licensing LLC.
Blackberry/QNX
Copyright © 2020 BlackBerry Limited. All rights reserved.
Trademarks, including but not limited to BLACKBERRY, EMBLEM Design, QNX, AVIAGE, MOMENTICS, NEUTRINO and QNX CAR are the trademarks or registered trademarks of BlackBerry Limited, used under license, and the exclusive rights to such trademarks are expressly reserved.
Trademarks
NVIDIA, the NVIDIA logo, and BlueField, CUDA, DALI, DRIVE, Hopper, JetPack, Jetson AGX Xavier, Jetson Nano, Maxwell, NGC, Nsight, Orin, Pascal, Quadro, Tegra, TensorRT, Triton, Turing and Volta are trademarks and/or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.