Optimizing and Profiling with TensorRT#

The NVIDIA TensorRT SDK facilitates high-performance inference for machine learning models. Models exported from TAO can be directly optimized and profiled with TensorRT using the trtexec tool, which is a command line wrapper that helps quickly utilize and prototype models with TensorRT, without requiring you to write your own inference application. It serves 3 main purposes:

Bechmarking (profiling) networks on random or user-provided input data
Generating serialized TensorRT engines from models. In the case of TAO, models are in ONNX or UFF format.
Generalizing a serialized timing cache from the TensorRT builder

trtexec has several command line flags that help customize the inputs, outputs, and TensorRT build configuration of the models, including network precision, layer-wise precision, and number of iterations` to run profiling, etc.

These are the most commonly used CLI arguments:

--onnx=<model>`: Specify the input ONNX model.
--uff=<model>`: Specify the input UFF model.
--output=<tensor>`: Specify output tensor names. Only required if the input models are in UFF.
--maxBatch=<BS>`: Specify the maximum batch size to build the engine with.
Only needed if the input models are in UFF or Caffe formats. If the input model is in ONNX format, use the –minShapes, –optShapes, and –maxShapes flags to control the range of input shapes including batch size.
--minShapes=<shapes>: Minimum shape of the input tensor. This input is formatted as “<input_node_name>:NxCxHxW” where, input_node_name
is the name of the model’s input node and N,C,H and W are input batch size, input channels, input height and input width of the tensor. This can be comma separated for models with multiple input nodes. Only required if the model is in ONNX format.
--optShapes=<shapes>: The optimum shape of the input tensor. This input is formatted as “<input_node_name>:NxCxHxW” where, input_node_name
is the name of the model’s input node and N,C,H and W are input batch size, input channels, input height and input width of the tensor. This can be comma separated for models with multiple input nodes. Only required if the model is in ONNX format.
--maxShapes=<shapes>:The maximum shape of the input tensor. This input is formatted as “<input_node_name>:NxCxHxW” where, input_node_name
is the name of the model’s input node and N,C,H and W are input batch size, input channels, input height and input width of the tensor. This can be comma separated for models with multiple input nodes. Only required if the model is in ONNX format.
--saveEngine=<file>: Specify the path to save the engine to.
--fp16, --int8, --noTF32, and --best: Specify network-level precision.
--timingCacheFile=<file>: Specify the timing cache to load from and save to.
--verbose: Turn on verbose logging.
--skipInference: Build and save the engine without running inference.
-–useDLACore=N: Use the specified DLA core for layers that support DLA.
-–allowGPUFallback: Allow layers unsupported on DLA to run on GPU instead.
--loadEngine=<file>: Load the engine from a serialized plan file instead of building it from input ONNX, UFF, or Caffe model.
--batch=<N>: Specify the batch size to run the inference with.
Only needed if the input models are in UFF or Caffe formats. If the input model is in ONNX format or if the engine is built with explicit batch dimension, use –shapes instead.
--shapes=<shapes>: Specify the input shapes to run the inference with.
--calib=<file>: Read INT8 calibration cache file

The trtexec tool ships as part of the NVIDIA TensorRT SDK. Refer to the official TensorRT documentation for installation and usage details on each platform.

TRTEXEC instructions by Model