Optimizing and Profiling with TensorRT
The NVIDIA TensorRT SDK facilitates high-performance inference for machine learning models. As of TAO Toolkit version 5.0.0, models
exported via the tao model <model_name> export
endpoint can now be directly optimized and profiled with TensorRT using the
trtexec
tool, which is a command line wrapper that helps quickly utilize and protoype models with TensorRT, without requiring
you to write your own inference application. It serves 3 main purposes:
Bechmarking (profiling) networks on random or user-provided input data
Generating serialized TensorRT engines from models. In the case of TAO Toolkit, models are in ONNX or UFF format.
Generalizing a serialized timing cache from the TensorRT builder
trtexec
has several command line flags that help customize the inputs, outputs, and TensorRT build configuration of the models, including network precision
, layer-wise precision
, and number of iterations`
to run profiling, etc.
These are the most commonly used CLI arguments:
--onnx=<model>`
: Specify the input ONNX model.--uff=<model>`
: Specify the input UFF model.--output=<tensor>`
: Specify output tensor names. Only required if the input models are in UFF.--maxBatch=<BS>`
: Specify the maximum batch size to build the engine with.Only needed if the input models are in UFF or Caffe formats. If the input model is in ONNX format, use the –minShapes, –optShapes, and –maxShapes flags to control the range of input shapes including batch size.
--minShapes=<shapes>
: Minimum shape of the input tensor. This input is formatted as “<input_node_name>:NxCxHxW” where, input_node_nameis the name of the model’s input node and N,C,H and W are input batch size, input channels, input height and input width of the tensor. This can be comma separated for models with multiple input nodes. Only required if the model is in ONNX format.
--optShapes=<shapes>
: The optimum shape of the input tensor. This input is formatted as “<input_node_name>:NxCxHxW” where, input_node_nameis the name of the model’s input node and N,C,H and W are input batch size, input channels, input height and input width of the tensor. This can be comma separated for models with multiple input nodes. Only required if the model is in ONNX format.
--maxShapes=<shapes>
:The maximum shape of the input tensor. This input is formatted as “<input_node_name>:NxCxHxW” where, input_node_nameis the name of the model’s input node and N,C,H and W are input batch size, input channels, input height and input width of the tensor. This can be comma separated for models with multiple input nodes. Only required if the model is in ONNX format.
--saveEngine=<file>
: Specify the path to save the engine to.--fp16, --int8, --noTF32, and --best
: Specify network-level precision.--timingCacheFile=<file>
: Specify the timing cache to load from and save to.--verbose
: Turn on verbose logging.--skipInference
: Build and save the engine without running inference.-–useDLACore=N
: Use the specified DLA core for layers that support DLA.-–allowGPUFallback
: Allow layers unsupported on DLA to run on GPU instead.--loadEngine=<file>
: Load the engine from a serialized plan file instead of building it from input ONNX, UFF, or Caffe model.--batch=<N>
: Specify the batch size to run the inference with.Only needed if the input models are in UFF or Caffe formats. If the input model is in ONNX format or if the engine is built with explicit batch dimension, use –shapes instead.
--shapes=<shapes>
: Specify the input shapes to run the inference with.--calib=<file>
: Read INT8 calibration cache file
The trtexec
endpoint is available as part of the TAO Deploy container and tao deploy`
mode.
TAO Converter, which has been deprecated for x86 devices, is still required for deploying to Jetson devices. TAO Converter is distributed as a separate binary for x86 and Jetson platforms. The tao-converter binaries are available as an NGC resource.
- TRTEXEC with ActionRecognitionNet
- TRTEXEC with BodyPoseNet
- TRTEXEC with CenterPose
- TRTEXEC with Classification TF1/TF2/PyT
- TRTEXEC with Deformable-DETR
- TRTEXEC with DetectNet-v2
- TRTEXEC with DINO
- TRTEXEC with DSSD
- TRTEXEC with EfficientDet TF1/TF2
- TRTEXEC with Facial Landmarks Estimation
- TRTEXEC with Faster RCNN
- TRTEXEC with LPRNet
- TRTEXEC with Metric Learning Recognition
- TRTEXEC with Mask RCNN
- TRTEXEC with Multitask Classification
- TRTEXEC with OCDNet
- TRTEXEC with OCRNet
- TRTEXEC with PointPillars
- TRTEXEC with PoseClassificationNet
- TRTEXEC with ReIdentificationNet
- TRTEXEC with ReIdentificationNet Transformer
- TRTEXEC with RetinaNet
- TRTEXEC with Segformer
- TRTEXEC with SiameseOI
- TRTEXEC with SSD
- TRTEXEC with UNet
- TRTEXEC with YOLO_v3
- TRTEXEC with YOLO_v4
- TRTEXEC with YOLO_v4_tiny
- TRTEXEC with VisualChangeNet