Core Concepts

TensorRT Workflow

The general TensorRT workflow consists of 3 steps:

  1. Populate a tensorrt.INetworkDefinition either with a parser or by using the TensorRT Network API (see tensorrt.INetworkDefinition for more details). The tensorrt.Builder can be used to generate an empty tensorrt.INetworkDefinition .
  2. Use the tensorrt.Builder to build a tensorrt.ICudaEngine using the populated tensorrt.INetworkDefinition .
  3. Create a tensorrt.IExecutionContext from the tensorrt.ICudaEngine and use it to perform optimized inference.

Classes Overview

Logger

Most other TensorRT classes use a logger to report errors, warnings and informative messages. TensorRT provides a basic tensorrt.Logger implementation, but it can be extended for more advanced functionality.

Engine and Context

The tensorrt.ICudaEngine is the primary element of TensorRT. It is used to generate a tensorrt.IExecutionContext that can perform inference.

Builder

The tensorrt.Builder is used to build a tensorrt.ICudaEngine . In order to do so, it must be provided a populated tensorrt.INetworkDefinition .

Network

The tensorrt.INetworkDefinition represents a computational graph. In order to populate the network, TensorRT provides a suite of parsers for a variety of Deep Learning frameworks. It is also possible to populate the network manually using the Network API.

Parsers

Parsers are used to populate a tensorrt.INetworkDefinition from a model trained in a Deep Learning framework.

TensorRT Object Lifetime Management

The legacy bindings required destroy() calls in order to properly deallocate memory. The new API automatically frees memory when objects go out of scope. Even so, it is generally desirable to destroy objects as soon as they are no longer required. The preferred method of object lifetime management with the new Python API is to use with ... as ... clauses to scope objects. For example, a typical inference pipeline using the ONNX parser might look something like this (inference code omitted for clarity):

import tensorrt as trt

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
ONNX_MODEL = "mnist.onnx"

def build_engine():
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, \
        trt.OnnxParser(network, TRT_LOGGER) as parser:
        # Configure the builder here.
        builder.max_workspace_size = 2**30
        # In this example, we use the ONNX parser, but this should be replaced
        # according to your needs. This step might instead use the Caffe/UFF parser,
        # or even the Network API to build a TensorRT Network manually .
        with open(ONNX_MODEL, 'rb') as model:
            parser.parse(model.read())
        # Build and return the engine. Note that the builder,
        # network and parser are destroyed when this function returns.
        return builder.build_cuda_engine(network)

def do_inference():
    with build_engine() as engine, engine.create_execution_context() as context:
        # Allocate buffers and create a CUDA stream before inference.
        # This should only be done once.
        pass
        # Preprocess input (if required), then copy to the GPU, do inference,
        # and copy the output back to the host.
        pass