TensorRT Release Notes
NVIDIA TensorRT is a C++ library that facilitates high performance inference on NVIDIA GPUs. It is designed to work in connection with deep learning frameworks that are commonly used for training. TensorRT focuses specifically on running an already trained network quickly and efficiently on a GPU for the purpose of generating a result; also known as inferencing. These release notes describe the key features, software enhancements and improvements, and known issues for the TensorRT 9.2.0 product package.
For previously released TensorRT documentation, refer to the TensorRT Archives.
2. TensorRT Release 9.x.x
1.1. TensorRT Release 9.2.0
This GA release is for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only. For applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1 for production use.
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
1.2. TensorRT Release 9.1.0
This GA release is for Large Language Models (LLMs) on NVIDIA A100, A10G, L4, L40, L40S, H100 GPUs, and NVIDIA GH200 Grace Hopper™ Superchip only. For applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1 for production use.
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
1.3. TensorRT Release 9.0.1
This GA release is for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only. For applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1 for production use.
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
Fixed Issues
- There was an up to 9% performance regression compared to TensorRT 8.5 on Yolov3 batch size 1 in FP32 precision on NVIDIA Ada Lovelace GPUs. This issue has been fixed.
- There was an up to 13% performance regression compared to TensorRT 8.5 on GPT2 without kv-cache in FP16 precision when dynamic shapes are used on NVIDIA Volta and NVIDIA Ampere GPUs. The workaround was to set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false. This issue has been fixed.
- For transformer-based networks such as BERT and GPT, TensorRT could consume CPU memory up to 2 times the model size during compilation plus any additional formats timed. This issue has been fixed.
- In some cases, when using the OBEY_PRECISION_CONSTRAINTS builder flag and the required type was set to FP32, the network could fail with a missing tactic due to an incorrect optimization converting the output of an operation to FP16. This could be resolved by removing the OBEY_PRECISION_CONSTRAINTS option. This issue has been fixed.
- TensorRT could fail on devices with small RAM and swap space. This could be resolved by ensuring the RAM and swap space is at least 7 times the size of the network. For example, at least 21 GB of combined CPU memory for a 3 GB network. This issue has been fixed.
- If an IShapeLayer was used to get the output shape of an INonZeroLayer, engine building would likely fail. This issue has been fixed.
- If a one-dimension INT8 input was used for a Unary or ElementWise operation, engine building would throw an internal error “Could not find any implementation for node”. This issue has been fixed.
- Using the compute sanitizer racecheck tool could cause the process to be terminated unexpectedly. The root cause was a wrong false alarm. The issue could be bypassed with --kernel-regex-exclude kns=scudnn_winograd. This issue has been fixed.
- TensorRT in FP16 mode did not perform cast operations correctly when only the output types were set, but not the layer precisions. This issue has been fixed.
- There was a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform. This issue has been fixed.
- When using DLA, INT8 convolutions followed by FP16 layers could cause accuracy degradation. In such cases, you could either change the convolution to FP16 or the subsequent layer to INT8. This issue has been fixed.
- Using the compute sanitizer tool from CUDA 12.0 could report a cudaErrorIllegalInstruction error on Hopper GPUs in unusual scenarios. This could be ignored. This issue has been fixed.
- There was an up to 53% engine build time increase for ViT networks in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
- There was an up to 93% engine build time increase for BERT networks in FP16 precision compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
- There was an up to 22% performance drop for GPT2 networks in FP16 precision with kv-cache stacked together compared to TensorRT 8.6 on NVIDIA Ampere GPUs. This issue has been fixed.
- There was an up to 28% performance regression compared to TensorRT 8.5 on Transformer networks in FP16 precision on NVIDIA Volta GPUs, and up to 85% performance regression on NVIDIA Pascal GPUs. Disabling the kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 preview flag was a workaround. This issue has been fixed.
- There was an issue on NVIDIA A2 GPUs due to a known hang that can impact multiple networks. This issue has been fixed.
- When using hardware compatibility features, TensorRT would potentially fail while compiling transformer based networks such as BERT. This issue has been fixed.
- The ONNX parser incorrectly used the InstanceNormalization plugin when processing ONNX InstanceNormalization layers with FP16 scale and bias inputs, which led to unexpected NaN and infinite outputs. This issue has been fixed.
1.4. TensorRT Release 9.0.0 Early Access (EA)
This EA release is for early testing and feedback for Large Language Models (LLMs) on A100, A10G, L4, L40, and H100 GPUs only. For production use of TensorRT, applications other than LLMs, or other GPU platforms, continue to use TensorRT 8.6.1.
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
Fixed Issues
- There was an up to 9% performance regression compared to TensorRT 8.5 on Yolov3 batch size 1 in FP32 precision on NVIDIA Ada Lovelace GPUs. This issue has been fixed.
- There was an up to 13% performance regression compared to TensorRT 8.5 on GPT2 without kv-cache in FP16 precision when dynamic shapes are used on NVIDIA Volta and NVIDIA Ampere GPUs. The workaround was to set the kFASTER_DYNAMIC_SHAPES_0805 preview flag to false. This issue has been fixed.
- For transformer-based networks such as BERT and GPT, TensorRT could consume CPU memory up to 2 times the model size during compilation plus any additional formats timed. This issue has been fixed.
- In some cases, when using the OBEY_PRECISION_CONSTRAINTS builder flag and the required type was set to FP32, the network could fail with a missing tactic due to an incorrect optimization converting the output of an operation to FP16. This could be resolved by removing the OBEY_PRECISION_CONSTRAINTS option. This issue has been fixed.
- TensorRT could fail on devices with small RAM and swap space. This could be resolved by ensuring the RAM and swap space is at least 7 times the size of the network. For example, at least 21 GB of combined CPU memory for a 3 GB network. This issue has been fixed.
- If an IShapeLayer was used to get the output shape of an INonZeroLayer, engine building would likely fail. This issue has been fixed.
- If a one-dimension INT8 input was used for a Unary or ElementWise operation, engine building would throw an internal error “Could not find any implementation for node”. This issue has been fixed.
- Using the compute sanitizer racecheck tool could cause the process to be terminated unexpectedly. The root cause was a wrong false alarm. The issue could be bypassed with --kernel-regex-exclude kns=scudnn_winograd. This issue has been fixed.
- TensorRT in FP16 mode did not perform cast operations correctly when only the output types were set, but not the layer precisions. This issue has been fixed.
- There was a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform. This issue has been fixed.
- When using DLA, INT8 convolutions followed by FP16 layers could cause accuracy degradation. In such cases, you could either change the convolution to FP16 or the subsequent layer to INT8. This issue has been fixed.
- Using the compute sanitizer tool from CUDA 12.0 could report a cudaErrorIllegalInstruction error on Hopper GPUs in unusual scenarios. This could be ignored. This issue has been fixed.
TensorRT Release 8.x.x
2.1. TensorRT Release 8.6.1
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
2.2. TensorRT Release 8.6.0 Early Access (EA)
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
This EA release is for early testing and feedback. For production use of TensorRT, continue to use TensorRT 8.5.3.
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
2.3. TensorRT Release 8.5.3
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
Announcements
- In the next TensorRT release, cuDNN, cuBLAS, and cuBLASLt tactic sources will be turned off by default in builder profiling. TensorRT plans to remove the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. Use the PreviewFeature flag kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to evaluate the functional and performance impact of disabling cuBLAS and cuDNN and report back to TensorRT if there are critical regressions in your use cases.
- TensorRT Python wheel files before TensorRT 8.5, such as TensorRT 8.4, were published to the NGC PyPI repo. Starting with TensorRT 8.5, Python wheels will instead be published to upstream PyPI. This will make it easier to install TensorRT because it requires no prerequisite steps. Also, the name of the Python package for installation has changed from nvidia-tensorrt to just tensorrt.
- The C++ and Python API documentation in previous releases was included inside the tar file packaging. This release no longer bundles the documentation inside the tar file since the online documentation can be updated post release and avoids encountering mistakes found in stale documentation inside the packages.
2.4. TensorRT Release 8.5.2
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
Announcements
- In the next TensorRT release, CUDA toolkit 10.2 support will be dropped.
- TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x) devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT 9.0.
- In the next TensorRT release, cuDNN, cuBLAS, and cuBLASLt tactic sources will be turned off by default in builder profiling. TensorRT plans to remove the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. Use the PreviewFeature flag kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to evaluate the functional and performance impact of disabling cuBLAS and cuDNN and report back to TensorRT if there are critical regressions in your use cases.
- TensorRT Python wheel files before TensorRT 8.5, such as TensorRT 8.4, were published to the NGC PyPI repo. Starting with TensorRT 8.5, Python wheels will instead be published to upstream PyPI. This will make it easier to install TensorRT because it requires no prerequisite steps. Also, the name of the Python package for installation has changed from nvidia-tensorrt to just tensorrt.
- The C++ and Python API documentation in previous releases was included inside the tar file packaging. This release no longer bundles the documentation inside the tar file since the online documentation can be updated post release and avoids encountering mistakes found in stale documentation inside the packages.
2.5. TensorRT Release 8.5.1
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
Announcements
- In the next TensorRT release, CUDA toolkit 10.2 support will be dropped.
- TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x) devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT 9.0.
- In the next TensorRT release, cuDNN, cuBLAS, and cuBLASLt tactic sources will be turned off by default in builder profiling. TensorRT plans to remove the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. Use the PreviewFeature flag kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to evaluate the functional and performance impact of disabling cuBLAS and cuDNN and report back to TensorRT if there are critical regressions in your use cases.
- TensorRT Python wheel files before TensorRT 8.5, such as TensorRT 8.4, were published to the NGC PyPI repo. Starting with TensorRT 8.5, Python wheels will instead be published to upstream PyPI. This will make it easier to install TensorRT because it requires no prerequisite steps. Also, the name of the Python package for installation has changed from nvidia-tensorrt to just tensorrt.
- The C++ and Python API documentation in previous releases was included inside the tar file packaging. This release no longer bundles the documentation inside the tar file since the online documentation can be updated post release and avoids encountering mistakes found in stale documentation inside the packages.
2.6. TensorRT Release 8.4.3
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
2.7. TensorRT Release 8.4.2
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
2.8. TensorRT Release 8.4.1
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
2.9. TensorRT Release 8.4.0 Early Access (EA)
These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
2.10. TensorRT Release 8.2.5
These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
- APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
- APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
- APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
2.11. TensorRT Release 8.2.4
These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
- APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
- APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
- APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
2.12. TensorRT Release 8.2.3
These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
- APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
- APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
- APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
2.13. TensorRT Release 8.2.2
These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
- APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
- APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
- APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
2.14. TensorRT Release 8.2.1
These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).
This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
Deprecated API Lifetime
- APIs deprecated prior to TensorRT 8.0 will be removed in TensorRT 9.0.
- APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
- APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.
2.15. TensorRT Release 8.2.0 Early Access (EA)
These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).
This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
2.16. TensorRT Release 8.0.3
These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).
This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
2.17. TensorRT Release 8.0.2
These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).
This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.
2.18. TensorRT Release 8.0.1
This is the TensorRT 8.0.1 release notes and is applicable to x86 Linux and Windows users, as well as PowerPC and ARM Server Base System Architecture (SBSA) users on Linux only.
These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).
This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.