NVIDIA TensorRT

NVIDIA TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference. It is designed to work in a complementary fashion with training frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on running an already-trained network quickly and efficiently on NVIDIA hardware.

TensorRT includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. The core of NVIDIA TensorRT is a C++ library that facilitates high-performance inference on NVIDIA GPUs. TensorRT takes a trained network, which consists of a network definition and a set of trained parameters, and produces a highly optimized runtime engine that performs inference for that network. Refer to the following TensorRT product documentation for more information.

Documentation Center
These documents provide information regarding the current NVIDIA TensorRT 8.6.0 Early Access (EA) release.
Getting Started
This NVIDIA TensorRT Quick Start Guide is a starting point for developers who want to try out TensorRT SDK; specifically, this document demonstrates how to quickly construct an application to run inference on a TensorRT engine.
NVIDIA TensorRT is a C++ library that facilitates high performance inference on NVIDIA GPUs. It is designed to work in connection with deep learning frameworks that are commonly used for training. TensorRT focuses specifically on running an already trained network quickly and efficiently on a GPU for the purpose of generating a result; also known as inferencing. These release notes describe the key features, software enhancements and improvements, and known issues for the latest TensorRT release.
These support matrices provide a look into the supported platforms, features, and hardware capabilities of the NVIDIA TensorRT APIs, parsers, and layers.
This NVIDIA TensorRT Installation Guide provides the installation requirements, a list of what is included in the TensorRT package, and step-by-step instructions for installing TensorRT.
Inference Library
This is the API Reference documentation for the NVIDIA TensorRT library. The following set of APIs allows developers to import pre-trained models, calibrate networks for INT8, and build and deploy optimized networks with TensorRT. Networks can be imported from ONNX. They may also be created programmatically using the C++ or Python API by instantiating individual layers and setting parameters and weights directly.
This NVIDIA TensorRT Developer Guide demonstrates how to use the C++ and Python APIs for implementing the most common deep learning layers. It shows how you can take an existing model built with a deep learning framework and build a TensorRT engine using the provided parsers. The Developer Guide also provides step-by-step instructions for common user tasks such as creating a TensorRT network definition, invoking the TensorRT builder, serializing and deserializing, and how to feed the engine with data and perform inference; all while using either the C++ or Python API.
In TensorRT, operators represent distinct flavors of mathematical and programmatic operations. The following sections describe every operator that TensorRT supports. The minimum workspace required by TensorRT depends on the operators used by the network. A suggested minimum build-time setting is 16 MB. Regardless of the maximum workspace value provided to the builder, TensorRT will allocate at runtime no more than the workspace it requires.
This Samples Support Guide provides an overview of all the supported NVIDIA TensorRT samples included on GitHub and in the product package. The TensorRT samples specifically help in areas such as recommenders, machine comprehension, character recognition, image classification, and object detection.
Optimized Frameworks
The TensorRT container is an easy to use container for TensorRT development. The container allows you to build, modify, and execute TensorRT samples. These release notes provide a list of key features, packaged software in the container, software enhancements and improvements, and known issues for the latest and earlier releases. The TensorRT container is released monthly to provide you with the latest NVIDIA deep learning software libraries and GitHub code contributions that have been sent upstream. The libraries and contributions have all been tested, tuned, and optimized.
Tools
This is the Python API documentation for ONNX GraphSurgeon. ONNX GraphSurgeon provides a convenient way to create and modify ONNX models.
This is the Python API documentation for Polygraphy. Polygraphy is a toolkit designed to assist in running and debugging deep learning models in various frameworks.
PyTorch-Quantization is a toolkit for training and evaluating PyTorch models with simulated quantization. Quantization can be added to the model automatically, or manually, allowing the model to be tuned for accuracy and performance. The quantized model can be exported to ONNX and imported to an upcoming version of TensorRT.
NVIDIA TensorFlow Quantization Toolkit provides a simple API to quantize a given Keras model. Initially, the network is trained on the target dataset until fully converged. The quantization step consists of inserting Q/DQ nodes in the pretrained network to simulate quantization during training. The network is then retrained for a few epochs to recover accuracy in a step called fine-tuning.
Archives

These archives provide access to previously released TensorRT documentation versions.

These archives provide access to previously released TensorRT documentation versions.