TensorRT Overview

The core of NVIDIA® TensorRT™ is a C++ library that facilitates high-performance inference on NVIDIA graphics processing units (GPUs). TensorRT takes a trained network, which consists of a network definition and a set of trained parameters, and produces a highly optimized runtime engine that performs inference for that network.

You can describe a TensorRT network by using a C++ or Python API, or you can import an existing Caffe, ONNX, or TensorFlow model by using one of the provided parsers.

TensorRT provides APIs through C++ and Python that help express deep learning models by using the Network Definition API or load a predefined model by using parsers that allows TensorRT to optimize and run them on an NVIDIA GPU. TensorRT applies graph optimizations, layer fusion, and other optimizations, while also finding the fastest implementation of that model by leveraging a diverse collection of highly optimized kernels. TensorRT also supplies a runtime that you can use to execute this network on all NVIDIA’s GPUs from the NVIDIA Kepler™ generation onwards.

TensorRT also includes optional high-speed, mixed precision capabilities that were introduced in Tegra X1 and were extended with the NVIDIA Pascal, NVIDIA Volta™, and NVIDIA Turing™ architectures.

The TensorRT container allows TensorRT samples to be built, modified, and executed. For more information about the TensorRT samples, see the TensorRT Sample Support Guide.

For a complete list of installation options and instructions, refer to Installing TensorRT.