Overview#

This section demonstrates how to use the C++ and Python APIs to implement the most common deep learning layers. It shows how to take an existing model built with a deep learning framework and build a TensorRT engine using the provided parsers.

Samples#

The Sample Support Guide illustrates many of the topics discussed in this section.

Complementary GPU Features#

Multi-instance GPU, or MIG, is a feature of NVIDIA GPUs with NVIDIA Ampere Architecture or later architectures that enable user-directed partitioning of a single GPU into multiple smaller GPUs. The physical partitions provide dedicated compute and memory slices with quality of service and independent execution of parallel workloads on fractions of the GPU. For TensorRT applications with low GPU utilization, MIG can produce higher throughput with little or no latency impact. The optimal partitioning scheme is application-specific.

Complementary Software#

The NVIDIA Triton Inference Server is a higher-level library providing optimized inference across CPUs and GPUs. It provides capabilities for starting and managing multiple models and REST and gRPC endpoints for serving inference.

NVIDIA DALI provides high-performance primitives for preprocessing image, audio, and video data. TensorRT inference can be integrated as a custom operator in a DALI pipeline. A working example of TensorRT inference integrated into DALI can be found on GitHub: DALI.

Torch-TensorRT (Torch-TRT) is a PyTorch-TensorRT compiler that converts PyTorch modules into TensorRT engines. Internally, the PyTorch modules are converted into TorchScript/FX modules based on the selected Intermediate Representation (IR). The compiler selects subgraphs of the PyTorch graphs to be accelerated by TensorRT while leaving Torch to execute the rest of the graph natively. The result is still a PyTorch module that you can execute as usual. For examples, refer to GitHub: Examples for Torch-TRT.

The TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs. TensorRT Model Optimizer replaces the PyTorch Quantization Toolkit and TensorFlow-Quantization Toolkit, which are no longer maintained. To quantize TensorFlow models, export to ONNX and then use Model Optimizer to quantize the model.

TensorRT is integrated with NVIDIA’s profiling tool, NVIDIA Nsight Systems.

TensorRT’s core functionalities are now accessible via NVIDIA’s Nsight Deep Learning Designer, an IDE for ONNX model editing, performance profiling, and TensorRT engine building.

A restricted subset of TensorRT is certified for use in NVIDIA DRIVE products. Some APIs are marked for use only in NVIDIA DRIVE and are not supported for general use.

ONNX#

TensorRT’s primary means of importing a trained model from a framework is the ONNX interchange format. TensorRT ships with an ONNX parser library to assist in importing models. Where possible, the parser is backward compatible up to opset 9; the ONNX Model Opset Version Converter can assist in resolving incompatibilities.

The GitHub version may support later opsets than the version shipped with TensorRT. Refer to the ONNX-TensorRT operator support matrix for the latest information on the supported opset and operators. For TensorRT deployment, we recommend exporting to the latest available ONNX opset.

The ONNX operator support list for TensorRT can be found on GitHub: Supported ONNX Operators.

PyTorch natively supports ONNX export. For TensorFlow, the recommended method is tf2onnx.

After exporting a model to ONNX, a good first step is to run constant folding using Polygraphy. This often solves TensorRT conversion issues in the ONNX parser and simplifies the workflow. For details, refer to this example. In some cases, modifying the ONNX model may be necessary, such as replacing subgraphs with plugins or reimplementing unsupported operations in other operations. To make this process easier, you can use ONNX-GraphSurgeon.

Code Analysis Tools#

For guidance using the Valgrind and Clang sanitizer tools with TensorRT, refer to the Troubleshooting section.

API Versioning#

TensorRT version number (MAJOR.MINOR.PATCH) follows Semantic Versioning 2.0.0 for its public APIs and library ABIs. Version numbers change as follows:

MAJOR version when making incompatible API or ABI changes
MINOR version when adding functionality in a backward-compatible manner
PATCH version when making backward-compatible bug fixes

Note that semantic versioning does not extend to serialized objects. To reuse plan files and timing caches, version numbers must match across major, minor, patch, and build versions (with some exceptions for the safety runtime as detailed in the NVIDIA DRIVE OS Developer Guide). Calibration caches can typically be reused within a major version, but compatibility beyond a specific patch version is not guaranteed.

Deprecation Policy#

Deprecation informs developers that some APIs and tools are no longer recommended. Beginning with version 8.0, TensorRT has the following deprecation policy:

Deprecation notices are communicated in the Release Notes.
When using C++ API:
- API functions are marked with the TRT_DEPRECATED_API macro.
- Enums are marked with the TRT_DEPRECATED_ENUM macro.
- All other locations are marked with the TRT_DEPRECATED macro.
- Classes, functions, and objects will have a statement documenting when they were deprecated.
When using the Python API, deprecated methods and classes will issue deprecation warnings at runtime if they are used.
TensorRT provides a 12-month migration period after the deprecation.
APIs and tools continue to work during the migration period.
After the migration period ends, APIs and tools are removed in a manner consistent with semantic versioning.

For any APIs and tools specifically deprecated in TensorRT 7.x, the 12-month migration period starts from the TensorRT 8.0 GA release date.

Hardware Support Lifetime#

TensorRT 8.5.3 was the last release supporting NVIDIA Kepler (SM 3.x) and NVIDIA Maxwell (SM 5.x) devices. These devices are no longer supported in TensorRT 8.6. NVIDIA Pascal (SM 6.x) devices were deprecated in TensorRT 8.6. TensorRT 10.4 was the last release supporting NVIDIA Volta (SM 7.0) devices. For more information, refer to the Support Matrix section.

Support#

Support, resources, and information about TensorRT can be found online at https://developer.nvidia.com/tensorrt. This includes blogs, samples, and more.

In addition, you can access the NVIDIA DevTalk TensorRT forum at https://devtalk.nvidia.com/default/board/304/tensorrt/ for all things related to TensorRT. This forum offers the possibility of finding answers, making connections, and getting involved in discussions with customers, developers, and TensorRT engineers.

Reporting Bugs#

NVIDIA appreciates all types of feedback. If you encounter any problems, follow the instructions in the Reporting TensorRT Issues section to report the issues.