Architecture Overview#
This section provides an overview of TensorRT’s architecture, design principles, and ecosystem. It introduces key concepts and complementary tools that work alongside TensorRT for optimized inference deployment.
Samples#
The Sample Support Guide illustrates many of the topics discussed in this section.
Complementary GPU Features#
Multi-instance GPU (MIG) is a feature of NVIDIA GPUs with NVIDIA Ampere Architecture or later architectures. It enables user-directed partitioning of a single GPU into multiple smaller GPUs.
The physical partitions provide dedicated compute and memory slices with quality of service. They also support independent execution of parallel workloads on fractions of the GPU.
For TensorRT applications with low GPU utilization, MIG can produce higher throughput with little or no latency impact. The optimal partitioning scheme is application-specific.
Complementary Software#
The NVIDIA Triton Inference Server is a higher-level library providing optimized inference across CPUs and GPUs. It provides capabilities for starting and managing multiple models and REST and gRPC endpoints for serving inference.
NVIDIA DALI provides high-performance primitives for preprocessing image, audio, and video data. TensorRT inference can be integrated as a custom operator in a DALI pipeline. A working example of TensorRT inference integrated into DALI is available on GitHub: DALI.
Torch-TensorRT (Torch-TRT) is a PyTorch-TensorRT compiler that converts PyTorch modules into TensorRT engines.
Internally, the PyTorch modules are converted into TorchScript/FX modules based on the selected Intermediate Representation (IR). The compiler selects subgraphs of the PyTorch graphs to be accelerated through TensorRT while leaving Torch to execute the rest of the graph natively.
The result is still a PyTorch module that you can execute as usual. For examples, refer to GitHub: Examples for Torch-TRT.
The Model Optimizer is a unified library of advanced model optimization techniques such as quantization, pruning, and distillation. It compresses deep learning models for downstream deployment frameworks such as TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
Model Optimizer replaces the PyTorch Quantization Toolkit and TensorFlow-Quantization Toolkit, which are no longer maintained.
To quantize TensorFlow models, export to ONNX and then use Model Optimizer to quantize the model.
TensorRT is integrated with NVIDIA’s profiling tool, NVIDIA Nsight Systems.
TensorRT’s core functionalities are now accessible through NVIDIA’s Nsight Deep Learning Designer, an IDE for ONNX model editing, performance profiling, and TensorRT engine building.
A restricted subset of TensorRT is certified for use in NVIDIA DRIVE products. Some APIs are marked for use only in NVIDIA DRIVE and are not supported for general use.
ONNX#
TensorRT’s primary means of importing a trained model from a framework is the ONNX interchange format. TensorRT ships with an ONNX parser library to assist in importing models. Where possible, the parser is backward compatible to opset 9. The ONNX Model Opset Version Converter can assist in resolving incompatibilities.
The GitHub version can support later opsets than the version shipped with TensorRT. Refer to the ONNX-TensorRT operator support matrix for the latest information on the supported opset and operators. For TensorRT deployment, we recommend exporting to the latest available ONNX opset.
The ONNX operator support list for TensorRT can be found on GitHub: Supported ONNX Operators.
PyTorch natively supports ONNX export. For TensorFlow, use tf2onnx.
After exporting a model to ONNX, run constant folding using Polygraphy as a good first step. This often solves TensorRT conversion issues in the ONNX parser and simplifies the workflow. For details, refer to this example. In some cases, modifying the ONNX model can be necessary, such as replacing subgraphs with plugins or reimplementing unsupported operations in other operations. To make this process easier, use ONNX-GraphSurgeon.
Code Analysis Tools#
For guidance using the Valgrind and Clang sanitizer tools with TensorRT, refer to the Troubleshooting section.
API Versioning#
TensorRT version number (MAJOR.MINOR.PATCH) follows Semantic Versioning 2.0.0 for its public APIs and library ABIs. Version numbers change as follows:
MAJOR version when making incompatible API or ABI changes.
MINOR version when adding functionality in a backward-compatible manner.
PATCH version when making backward-compatible bug fixes.
Semantic versioning does not extend to serialized objects. To reuse plan files and timing caches, version numbers must match across major, minor, patch, and build versions. Some exceptions exist for the safety runtime as detailed in the NVIDIA DriveOS Developer Guide.
Calibration caches can typically be reused within a major version, but compatibility beyond a specific patch version is not guaranteed.
Deprecation Policy#
Deprecation informs developers that some APIs and tools are no longer recommended. TensorRT has the following deprecation policy, beginning with version 8.0:
Deprecation notices are communicated in the Release Notes.
When using C++ API:
API functions are marked with the
TRT_DEPRECATED_APImacro.Enums are marked with the
TRT_DEPRECATED_ENUMmacro.All other locations are marked with the
TRT_DEPRECATEDmacro.Classes, functions, and objects will have a statement documenting when they were deprecated.
When using the Python API, deprecated methods and classes will issue deprecation warnings at runtime if they are used.
TensorRT provides a 12-month migration period after the deprecation.
APIs and tools continue to work during the migration period.
After the migration period ends, APIs and tools are removed in a manner consistent with semantic versioning.
For any APIs and tools specifically deprecated in TensorRT 7.x, the 12-month migration period starts from the TensorRT 8.0 GA release date.
Hardware Support Lifetime#
TensorRT 8.5.3 was the last release supporting NVIDIA Kepler (SM 3.x) and NVIDIA Maxwell (SM 5.x) devices. These devices are no longer supported in TensorRT 8.6. NVIDIA Pascal (SM 6.x) devices were deprecated in TensorRT 8.6. TensorRT 10.4 was the last release supporting NVIDIA Volta (SM 7.0) devices. Refer to the Support Matrix section for more information.
Support#
Support, resources, and information about TensorRT can be found online at https://developer.nvidia.com/tensorrt. This includes blogs, samples, and more.
You can also access the NVIDIA DevTalk TensorRT forum at https://devtalk.nvidia.com/default/board/304/tensorrt/ for all things related to TensorRT. This forum offers the possibility of finding answers, making connections, and getting involved in discussions with customers, developers, and TensorRT engineers.
Reporting Bugs#
NVIDIA appreciates all types of feedback. If you encounter any problems, follow the instructions in the Reporting TensorRT Issues section to report the issues.