Optimizing and Profiling with TensorRT#

The NVIDIA TensorRT SDK facilitates high-performance inference for machine learning models. As of TAO version 5.0.0, models exported via the tao model <model_name> export endpoint can now be directly optimized and profiled with TensorRT using the trtexec tool, which is a command line wrapper that helps quickly utilize and protoype models with TensorRT, without requiring you to write your own inference application. It serves 3 main purposes:

Bechmarking (profiling) networks on random or user-provided input data
Generating serialized TensorRT engines from models. In the case of TAO, models are in ONNX or UFF format.
Generalizing a serialized timing cache from the TensorRT builder

trtexec has several command line flags that help customize the inputs, outputs, and TensorRT build configuration of the models, including network precision, layer-wise precision, and number of iterations` to run profiling, etc.

These are the most commonly used CLI arguments:

--onnx=<model>`: Specify the input ONNX model.
--uff=<model>`: Specify the input UFF model.
--output=<tensor>`: Specify output tensor names. Only required if the input models are in UFF.
--maxBatch=<BS>`: Specify the maximum batch size to build the engine with.
Only needed if the input models are in UFF or Caffe formats. If the input model is in ONNX format, use the –minShapes, –optShapes, and –maxShapes flags to control the range of input shapes including batch size.
--minShapes=<shapes>: Minimum shape of the input tensor. This input is formatted as “<input_node_name>:NxCxHxW” where, input_node_name
is the name of the model’s input node and N,C,H and W are input batch size, input channels, input height and input width of the tensor. This can be comma separated for models with multiple input nodes. Only required if the model is in ONNX format.
--optShapes=<shapes>: The optimum shape of the input tensor. This input is formatted as “<input_node_name>:NxCxHxW” where, input_node_name
is the name of the model’s input node and N,C,H and W are input batch size, input channels, input height and input width of the tensor. This can be comma separated for models with multiple input nodes. Only required if the model is in ONNX format.
--maxShapes=<shapes>:The maximum shape of the input tensor. This input is formatted as “<input_node_name>:NxCxHxW” where, input_node_name
is the name of the model’s input node and N,C,H and W are input batch size, input channels, input height and input width of the tensor. This can be comma separated for models with multiple input nodes. Only required if the model is in ONNX format.
--saveEngine=<file>: Specify the path to save the engine to.
--fp16, --int8, --noTF32, and --best: Specify network-level precision.
--timingCacheFile=<file>: Specify the timing cache to load from and save to.
--verbose: Turn on verbose logging.
--skipInference: Build and save the engine without running inference.
-–useDLACore=N: Use the specified DLA core for layers that support DLA.
-–allowGPUFallback: Allow layers unsupported on DLA to run on GPU instead.
--loadEngine=<file>: Load the engine from a serialized plan file instead of building it from input ONNX, UFF, or Caffe model.
--batch=<N>: Specify the batch size to run the inference with.
Only needed if the input models are in UFF or Caffe formats. If the input model is in ONNX format or if the engine is built with explicit batch dimension, use –shapes instead.
--shapes=<shapes>: Specify the input shapes to run the inference with.
--calib=<file>: Read INT8 calibration cache file

The trtexec endpoint is available as part of the TAO Deploy container and tao deploy` mode.

Note

TAO Converter, which has been deprecated for x86 devices, is still required for deploying to Jetson devices. TAO Converter is distributed as a separate binary for x86 and Jetson platforms. The tao-converter binaries are available as an NGC resource.

TRTEXEC instructions by Model