Abstract

This guide provides instructions on how to accelerate inference in TensorFlow with TensorRT (TF-TRT).

1. Integrating Overview

TensorFlow™ integration with TensorRT™ (TF-TRT) optimizes and executes compatible subgraphs, allowing TensorFlow to execute the remaining graph. While you can still use TensorFlow's wide and flexible feature set, TensorRT will parse the model and apply optimizations to the portions of the graph wherever possible. Therefore, the workflow includes importing a trained TensorFlow model (graph and weights), freezing the graph, creating an optimized graph with TensorRT, importing it back as the default graph, and running inference.

After freezing the TensorFlow graph for inference, you request TensorRT to optimize TensorFlow's subgraphs. TensorRT then replaces each supported subgraph with a TensorRT optimized node, producing a frozen graph that runs in TensorFlow for inference.

TensorFlow executes the graph for all supported areas and calls TensorRT to execute TensorRT optimized nodes. TensorRT performs several important transformations and optimizations to the neural network graph. First, layers with unused output are eliminated to avoid unnecessary computation. Next, where possible, convolution, bias, and ReLU layers are fused to form a single layer. Another transformation is horizontal layer fusion, or layer aggregation, along with the required division of aggregated layers to their respective output. Horizontal layer fusion improves performance by combining layers that take the same source tensor and apply the same operations with similar parameters.
Note: These graph optimizations do not change the underlying computation in the graph; instead, they look to restructure the graph to perform the operations much faster and more efficiently.

TF-TRT is part of the TensorFlow binary, which means when you install tensorflow-gpu, you will be able to use TF-TRT too.

1.1. Introduction

TensorFlow

TensorFlow is an open-source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) that flow between them. This flexible architecture lets you deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device without rewriting code.

TensorFlow was originally developed by researchers and engineers working on the Google Brain team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks (DNNs) research. The system is general enough to be applicable in a wide variety of other domains, as well.

For visualizing TensorFlow results, the Docker® image also contains TensorBoard. TensorBoard is a suite of visualization tools. For example, you can view the training histories as well as what the model looks like.

For information about the optimizations and changes that have been made to TensorFlow, see the TensorFlow Deep Learning Frameworks Release Notes.

TensorRT

The core of NVIDIA TensorRT is a C++ library that facilitates high performance inference on NVIDIA graphics processing units (GPUs). TensorRT takes a trained network, which consists of a network definition and a set of trained parameters, and produces a highly optimized runtime engine which performs inference for that network.

You can describe a TensorRT network using a C++ or Python API, or you can import an existing Caffe, ONNX, or TensorFlow model using one of the provided parsers.

The TensorRT API includes import methods to help you express your trained deep learning models for TensorRT to optimize and run. By applying graph optimizations and layer fusion, TensorRT finds the fastest implementation of your models, leveraging a diverse collection of highly optimized kernels and a runtime that you can use to execute this network in an inference context.

TensorRT includes an infrastructure that allows you to take advantage of the high speed mixed precision capabilities of Pascal, Volta, and Turing GPUs as an optional optimization.

For information about the optimizations and changes that have been made to TensorRT, see the TensorRT Release Notes. For specific TensorRT product documentation, see TensorRT documentation.

1.2. Benefits Of Integrating TensorFlow With TensorRT

TensorRT optimizes the largest subgraphs possible in the TensorFlow graph. The more compute in the subgraph, the greater benefit obtained from TensorRT. You want most of the graph optimized and replaced with the fewest number of TensorRT nodes for best performance. Based on the operations in your graph, it’s possible that the final graph might have more than one TensorRT node.

With the TensorFlow API, you can specify the minimum number of the nodes in a subgraph for it to be converted to a TensorRT node. Any sub-graph with less than the specified set number of nodes will not be converted to TensorRT engines even if it is compatible with TensorRT. This can be useful for models containing small compatible sub-graphs separated by incompatible nodes, in turn leading to tiny TensorRT engines.

1.3. What Capabilities Does TF-TRT Provide?

The Python TF-TRT API that can be used to optimize a TensorFlow frozen graph is create_inference_graph. This function has a number of parameters to configure the optimization.
def create_inference_graph(input_graph_def,
                           outputs,
                           max_batch_size=1,
                           max_workspace_size_bytes=2 << 20,
                           precision_mode="fp32",
                           minimum_segment_size=3,
                           is_dynamic_op=False,
                           maximum_cached_engines=1,
                           cached_engine_batches=[]
                           rewriter_config=None,
                           input_saved_model_dir=None,
                           input_saved_model_tags=None,
                           output_saved_model_dir=None,
                           session_config=None):
  """Python wrapper for the TRT transformation.
Where:
input_graph_def
This parameter is the GraphDef object that contains the model to be transformed.
outputs
This parameter lists the output nodes in the graph. Tensors which are not marked as outputs are considered to be transient values that may be optimized away by the builder.
max_batch_size
This parameter is the maximum batch size that specifies the batch size for which TensorRT will optimize. At runtime, a smaller batch size may be chosen. At runtime, larger batch size is not supported.
max_workspace_size_bytes
TensorRT operators often require temporary workspace. This parameter limits the maximum size that any layer in the network can use. If insufficient scratch is provided, it is possible that TensorRT may not be able to find an implementation for a given layer.
precision_mode
TF-TRT only supports models trained in FP32, in other words all the weights of the model should be stored in FP32 precision. That being said, TensorRT can convert tensors and weights to lower precisions during the optimization. The precision_mode parameter sets the precision mode; which can be one of fp32, fp16, or int8. Precision lower than FP32, meaning FP16 and INT8, would improve the performance of inference. The FP16 mode uses Tensor Cores or half precision hardware instructions, if possible. The INT8 precision mode uses integer hardware instructions.
minimum_segment_size
This parameter determines the minimum number of TensorFlow nodes in a TensorRT engine, which means the TensorFlow subgraphs that have fewer nodes than this number will not be converted to TensorRT. Therefore, in general smaller numbers such as 5 are preferred. This can also be used to change the minimum number of nodes in the optimized INT8 engines to change the final optimized graph to fine tune result accuracy.
is_dynamic_op
If this parameter is set to True, the conversion and building the TensorRT engines will happen during the runtime, which would be necessary if there are tensors in the graphs with unknown initial shapes or dynamic shapes.
Note: Conversion during runtime increases the latency. So you may only do that if the model is so small such that the conversion doesn’t block inference.
maximum_cached_engines
The maximum number of cached TensorRT engines in dynamic TensorRT ops.
cached_engine_batches
The batch sizes used to pre-create cached engines.
cached_engine_batches
The batch sizes used to pre-create cached engines.
rewriter_config
A RewriterConfig proto to append the TensorRTOptimizer to. If None, it will create one with default settings.
input_saved_model_dir
The directory to load the SavedModel containing the input graph to transform. Used only when input_graph_def is None.
input_saved_model_tags
A list of tags used to identify the MetaGraphDef of the SavedModel to load.
output_saved_model_dir
If not None, construct a SavedModel using the returned GraphDef and save it to the specified directory. This option only works when the input graph is loaded from a SavedModel, in other words, when input_saved_model_dir is specified and input_graph_def is None.
session_config
The ConfigProto used to create a Session. If not specified, a default ConfigProto will be used.
  • Returns:
    • New GraphDef with TRTEngineOps placed in graph replacing subgraphs.
  • Raises:
    • ValueError: If the provided precision mode is invalid.
    • RuntimeError: If the returned status message is malformed.

2. Prerequisites And Dependencies

TF-TRT is part of the TensorFlow binary, which means when you install tensorflow-gpu, you will be able to use TF-TRT too. There are two ways to install TF-TRT:
  • If you pull a TensorFlow container, for example 18.10, you will have all the software dependencies required by TF-TRT, therefore, you don’t need to install anything inside the container.
    Note: Some scripts under the nvidia-examples directory require additional dependencies, such as the Python requests package. Refer to the README files for installation instructions for these additional dependencies.
Or:
  • follow these instructions to compile TensorFlow with TensorRT integration from its source.

2.1. Support Matrix

Table 1. Support matrix for required software
  Ubuntu CUDA Toolkit TensorFlow Container Version
TensorRT 5.0.2
  • 18.04
  • 16.04
10.0.130 1.12.0 18.12
TensorRT 5.0.2
  • 18.04
  • 16.04
  • 14.04
10.0.130 1.12.0-rc2 18.11
TensorRT 5.0.0 Release Candidiate (RC)
  • 18.04
  • 16.04
  • 14.04
10.0.130 1.10
TensorRT 4.0.1
  • 16.04
  • 14.04
9.0
TensorRT 3.0.4
  • 16.04
  • 14.04
9.0 1.7

Support matrix for TensorRT layers

For support matrix tables about the layers, see:

For a description of each supported TensorRT layer, see TensorRT Layers.

3. Using TF-TRT

The minimal setup in Python for TensorFlow integration with TensorRT requires that you import the TensorRT module and, in addition to your regular inference workflow, create a TensorRT inference graph, using the create_inference_graph function. For more information about this function, see What Capabilities Does TF-TRT Provide?.

The specific workflows for creating a TensorRT inference graph from a TensorFlow model vary slightly depending on the format of the representation of your model. We offer three such workflows below, for SavedModel, frozen graph, and separate MetaGraph and checkpoint files. For more information, refer to the following articles:

3.1. Support Ops

The following table lists the operators that are supported by TF-TRT. Additionally, it also shows which versions of TensorRT the op is supported in.
Table 2. TF-TRT supported ops
Op TensorRT 5.0.2 TensorRT 5.0.0 RC TensorRT 4.0.1 TensorRT 3.0.4
Abs Yes Yes Yes Yes
Add Yes Yes Yes Yes
AvgPool Yes Yes Yes No
BatchMatMul Yes Yes Yes No
BiasAdd Yes Yes Yes Yes
ConcatV2 Yes Yes Yes Yes
Const Yes Yes Yes Yes
Conv2D Yes Yes Yes Yes
DepthwiseConv2dNative Yes Yes Yes Yes
Div Yes Yes Yes Yes
Exp Yes Yes Yes Yes
FusedBatchNorm Yes Yes Yes Yes
Identity Yes Yes Yes Yes
Log Yes Yes Yes Yes
MatMul Yes Yes Yes No
Max Yes Yes Yes No
MaxPool Yes Yes Yes No
Maximum Yes Yes Yes Yes
Mean Yes Yes Yes Yes
Min Yes Yes Yes No
Minimum Yes Yes Yes No
Mul Yes Yes Yes Yes
Neg Yes Yes Yes Yes
Pad Yes Yes Yes Yes
Prod Yes Yes Yes No
RealDiv Yes Yes Yes Yes
Reciprocal Yes Yes Yes Yes
Relu Yes Yes Yes Yes
Relu6 Yes No No No
Rsqrt Yes Yes Yes Yes
Snapshot Yes Yes Yes Yes
Softmax Yes Yes Yes No
Sqrt Yes Yes Yes Yes
Sub Yes Yes Yes Yes
Sum Yes Yes Yes No
TopKV2 Yes Yes Yes No

3.2. TF-TRT Workflow With A SavedModel

If you have a SavedModel representation of your TensorFlow model, you can create a TensorRT inference graph directly from your SavedModel, for example:
# Import TensorFlow and TensorRT
import tensorflow as tf
import tensorflow.contrib.tensorrt as trt
# Inference with TF-TRT `SavedModel` workflow:
graph = tf.Graph()
with graph.as_default():
    with tf.Session() as sess:
        # Create a TensorRT inference graph from a SavedModel:
        trt_graph = trt.create_inference_graph(
            input_saved_model_dir=”/path/to/your/saved/model”,
            input_saved_model_tags=[”your_saved_model_tags”],
            max_batch_size=your_batch_size,
            max_workspace_size_bytes=max_GPU_mem_size_for_TRT,
            precision_mode=”your_precision_mode”) 
        # Import the TensorRT graph into a new graph and run:
        output_node = tf.import_graph_def(
            trt_graph,
            return_elements=[“your_outputs”])
       sess.run(output_node)
Where, in addition to max_batch_size, max_workspace_size_bytes and precision_mode, you need to supply the following arguments to create_inference_graph:
input_saved_model_dir
Path to your SavedModel directory.
input_saved_model_tags
A list of tags used to identify the MetaGraphDef of the SavedModel to load.
And where:
[“your_outputs”]
A list of the name strings for the final result nodes in your graph.

3.3. TF-TRT Workflow With A Frozen Graph

If you have a frozen graph of your TensorFlow model, you first need to load the frozen graph file and parse it to create a deserialized GraphDef. Then you can use the GraphDef to create a TensorRT inference graph, for example:
# Import TensorFlow and TensorRT
import tensorflow as tf
import tensorflow.contrib.tensorrt as trt
# Inference with TF-TRT frozen graph workflow:
graph = tf.Graph()
with graph.as_default():
    with tf.Session() as sess:
        # First deserialize your frozen graph:
        with tf.gfile.GFile(“/path/to/your/frozen/graph.pb”, ‘rb’) as f:
            graph_def = tf.GraphDef()
            graph_def.ParseFromString(f.read())
        # Now you can create a TensorRT inference graph from your
        # frozen graph:
        trt_graph = trt.create_inference_graph(
            input_graph_def=graph_def,
            outputs=[“your_output_node_names”],
            max_batch_size=your_batch_size,
            max_workspace_size_bytes=max_GPU_mem_size_for_TRT,
            precision_mode=”your_precision_mode”)
        # Import the TensorRT graph into a new graph and run:
        output_node = tf.import_graph_def(
            trt_graph,
            return_elements=[“your_outputs”])
        sess.run(output_node)
Where, again in addition to max_batch_size, max_workspace_size_bytes, and precision_mode, you need to supply the following argument to create_inference_graph:
outputs
A list of the name strings for the final result nodes in your graph.
And where:
“/path/to/your/frozen/graph.pb”
Path to the frozen graph of your model.
[“your_outputs”]
A list of the name strings for the final result nodes in your graph. Same as outputs above.

3.4. TF-TRT Workflow With MetaGraph And Checkpoint Files

If you don’t have a SavedModel or a frozen graph representation of your TensorFlow model but have separate MetaGraph and checkpoint files, you first need to use these to create a frozen graph to then feed into the create_inference_graph function, for example:
# Import TensorFlow and TensorRT
import tensorflow as tf
import tensorflow.contrib.tensorrt as trt
# Inference with TF-TRT `MetaGraph` and checkpoint files workflow:
graph = tf.Graph()
with graph.as_default():
    with tf.Session() as sess:
        # First create a `Saver` object (for saving and rebuilding a
        # model) and import your `MetaGraphDef` protocol buffer into it:
        saver = tf.train.import_meta_graph(“/path/to/your/model.ckpt.meta”)
        # Then restore your training data from checkpoint files:
        saver.restore(sess, “/path/to/your/model.ckpt”)
        # Finally, freeze the graph:
               your_outputs = [“your_output_node_names”]
        frozen_graph = tf.graph_util.convert_variables_to_constants(
            sess,
            tf.get_default_graph().as_graph_def(),
            output_node_names=[“your_outputs”])
        # Now you can create a TensorRT inference graph from your
        # frozen graph:
        trt_graph = trt.create_inference_graph(
            input_graph_def=frozen_graph,
            outputs=[“your_outputs”],
            max_batch_size=your_batch_size,
            max_workspace_size_bytes=max_GPU_mem_size_for_TRT,
            precision_mode=”your_precision_mode”)
        # Import the TensorRT graph into a new graph and run:
        output_node = tf.import_graph_def(
            trt_graph,
            return_elements=[“your_outputs”])
        sess.run(output_node)
Where, again in addition to max_batch_size, max_workspace_size_bytes and precision_mode, you need to supply the following argument to create_inference_graph:
outputs
A list of the name strings for the final result nodes in your graph.
And where:
“/path/to/your/model.ckpt.meta”
Path to the MetaGraphDef protocol buffer of your model. This is usually created and saved during training.
“/path/to/your/model.ckpt”
Path to your latest checkpoint file saved during training.
[“your_outputs”]
A list of the name strings for the final result nodes in your graph. Same as outputs above.

3.5. Running Inference With TF-TRT For Image Classification

The NVIDIA TensorFlow Docker containers include example Python scripts for using a few popular pre-trained image classification models to run inference on the ImageNet validation set. The following sections walk you through how to use these scripts.

3.5.1. Setting Up The Environment

We’ll be working in a recent NVIDIA TensorFlow Docker container (18.09 or 18.10) with TensorFlow 1.10 or newer.
  1. Clone the tensorflow/tensorrt repository.
    git clone https://github.com/tensorflow/tensorrt.git
  2. Clone the tensorflow/models repository.
    git clone https://github.com/tensorflow/models.git
  3. Add the directory where we cloned the tensorflow/models repository to PYTHONPATH to install tensorflow/models.
    cd models
    export PYTHONPATH=”$PYTHONPATH:$PWD”
    
  4. Run the TensorFlow Slim setup script.
    cd research/slim
    python setup.py install
    
  5. Install the requests package.
    pip install requests
Note: The PYTHONPATH environment variable is not saved between different shell sessions. To avoid having to set PYTHONPATH in each new shell session, you can add the following line to your .bashrc file:
export PYTHONPATH=”$PYTHONPATH:/path/to/tensorflow/models”

3.5.2. Obtaining The ImageNet Data

The example Python scripts under tensorrt/tftrt/examples/image-classification support only the TFRecord format for inputting data. The scripts also assume that validation TFRecords are named according to the pattern: validation-*-of-00128. To download and process the ImageNet data, you can:
  • Use the download_imagenet.sh script provided by TF Slim in the tensorflow/models repository under research/slim/datasets. Consult the README file under research/slim for instructions on how to use this script.
Or
  • Use the scripts provided in the nvidia-examples/build_imagenet_data directory. Follow the README file in that directory for instructions on how to use these scripts.

3.5.3. General Script Usage

The main Python script under tensorrt/tftrt/examples/image-classification isimage_classification.py. Assuming that the ImageNet validation data are located under /data/imagenet/train-val-tfrecord, we can evaluate inference with TF-TRT integration using the pre-trained ResNet V1 50 model as follows:
python image_classification.py --model resnet_v1_50 \
    --data_dir /data/imagenet/train-val-tfrecord \
    --use_trt \
    --precision fp16
Where:
--model
Which model to use to run inference, in this case ResNet V1 50.
--data_dir
Path to the ImageNet TFRecord validation files.
--use_trt
Convert the graph to a TensorRT graph.
--precision
Precision mode to use, in this case FP16.
The full set of parameters to the image_classification.py script follows.
-h, --help
Show a help message and exit.
--model
Which model to use to run inference. Currently supported models are mobilenet_v1, mobilenet_v2, nasnet_mobile, nasnet_large, resnet_v1_50, resnet_v2_50, vgg_16, vgg_19, inception_v3, inception_v4.
--data_dir
Path to the directory containing validation set TFRecord files.
--calib_data_dir
Path to the directory containing TFRecord files for calibrating int8.
--download_dir
Path to the directory where downloaded model checkpoints will be stored. (Default: ./data).
--use_trt
Convert the graph to a TensorRT graph.
--use_trt_dynamic_op
Generate dynamic TensorRT operations that build the TensorRT graph and engine at run time.
--precision
Precision mode to use. One of fp32 (default), fp16, or int8. fp16 and int8 only work in conjunction with --use_trt.
--batch_size
Number of images per batch. (Default: 8).
--minimum_segment_size
Minimum number of TensorFlow ops in a TensorRT engine. (Default: 2).
--num_iterations
How many iterations (batches) to evaluate. If not supplied, the whole set will be evaluated.
--display_every
Number of iterations executed between two consecutive display of metrics. (Default: 100).
--use_synthetic
Generate one batch of random data and use at every iteration.
--num_warmup_iterations
Number of initial iterations skipped from timing. (Default: 50).
--num_calib_inputs
Number of inputs (for examples, images) used for calibration (the last batch is skipped in case it is not full). (Default: 500).
--cache
Save the graph to a graphs directory where the script is located. If a converted graph is present on disk under the graphs directory, it is loaded instead of building the graph again.

3.5.4. Script Output

The script first loads the pre-trained model. If given the parameter --use_trt, the model is converted to a TensorRT graph, and the script displays:
  • the number of nodes before conversion (num_nodes(native_tf))
  • the number of nodes after conversion (num_nodes(trt_total))
  • the number of separate TensorRT nodes (num_nodes(trt_only))
  • how long the conversion took (time(s)(trt_conversion))
For example:
num_nodes(native_tf): 741
num_nodes(tftrt_total): 14
num_nodes(trt_only): 2
time(s)(trt_conversion): ***
Note: For a list of supported operations that can be converted to a TensorRT graph, see Support Ops.
The script then begins running inference on the ImageNet validation set, displaying run times of each iteration after the interval defined by the --display_every parameter (default: 100):
running inference...
    step 100/6250, iter_time(ms)=**.****, images/sec=***
    step 200/6250, iter_time(ms)=**.****, images/sec=***
    step 300/6250, iter_time(ms)=**.****, images/sec=***
On completion, the script prints overall accuracy and timing information over the inference session:
results of resnet_v1_50:
    accuracy: 75.91
    images/sec: ***
    99th_percentile(ms): ***
    total_time(s): ***
    latency_mean(ms): ***
The accuracy metric measures the percentage of predictions from inference that match the labels on the ImageNet validation set. The remaining metrics capture various performance measurements:
  • number of images processed per second (images/sec)
  • total time of the inference session (total_time(s))
  • the mean duration for each iteration (latency_mean(ms))
  • the slowest duration for an iteration (99th_percentile(ms))
If you save this output from inference.py, you can verify the accuracy metric by passing this output to the check_accuracy.py script:
python check_accuracy.py \
    --input /path/to/your/output/from/inference.py \
    --tolerance 0.1
Note: The accuracy metrics for most of the pre-trained models in check_accuracy.py are taken from the following source: https://github.com/tensorflow/models/tree/master/research/slim#pre-trained-models.

The metrics for the ResNet 50 models are taken from: https://github.com/tensorflow/models/tree/master/official/resnet#pre-trained-model.

The accuracy metric should match the acceptable accuracy of the model used, within the acceptable tolerance for the precision mode used:
checking accuracy...
input: /path/to/your/output/from/inference.py
tolerance: 0.1
PASS

The acceptable tolerance for FP32 and FP16 is 0.1%. For INT8 the acceptable tolerance is higher at 1.0%, reflecting the tradeoff between performance and precision.

3.5.5. Visualizing TF-TRT Graphs With TensorBoard

TensorBoard is a suite of visualization tools that make it easier to understand, debug, and optimize TensorFlow programs. You can use TensorBoard with image_classification.py to display how the pre-trained TensorFlow graphs are optimized with TensorRT integration.

After running inference on the ImageNet validation set, image_classification.py writes Estimator outputs used to evaluate the model (such as checkpoints and event files) to a ./model_dir directory in the same directory where image_classification.py is located. You can launch a TensorBoard session on this directory with the following command:
tensorboard --logdir ./model_dir
Note: In order to view a TensorBoard session running in a Docker container, you need to run the container with the --publish parameter to publish the port that TensorBoard uses (6006) to the machine hosting the container. The --publish parameter takes the form of --publish [host machine IP address]:[host machine port]:[container port]. For example, --publish 0.0.0.0:6006:6006 publishes TensorBoard’s port 6006 to the host machine at port 6006 over all network interfaces (0.0.0.0). If you run a Docker container with this parameter, you can then access a running TensorBoard session at http://[IP address of host machine]:6006.

You can then navigate a web browser to port 6006 on the machine hosting the Docker container (http://[IP address of host machine]:6006), where you can see an interactive visualization of the graph.

You can then run inference.py with the --use_trt parameter, and run TensorBoard on the ./model_dir again to see how the ResNet V1 50 graph changes when converted to a TensorRT graph. Figure 1 shows the graph of image_classification.py running inference with native TensorFlow using the ResNet V1 50 model (--model resnet_v1_50).
Figure 1. TensorBoard Visualization of Inference with Native TensorFlow Using ResNet V1 50. TensorBoard Visualization of Inference with Native TensorFlow Using ResNet V1 50.
This visualization displays all the nodes created by image_classification.py for running and evaluating inference using ResNet V1 50, so there are additional nodes for loading and saving data and for evaluating inference. If you double click the resnet_model node, you can see the nodes specific to the ResNet V1 50 model, as shown in Figure 2.
Figure 2. TensorBoard Visualization of ResNet V1 50 as a Native TensorFlow Graph. TensorBoard Visualization of ResNet V1 50 as a Native TensorFlow Graph.
Notice that the resnet_model subgraph contains 459 nodes as a native TensorFlow graph. You can then run image_classification.py with the --use_trt parameter, and run TensorBoard on the ./model_dir again to see that now the number of nodes in the resnet_model subgraph has been reduced from 459 to 4 when converted to a TensorRT graph as shown in Figure 3.
Figure 3. TensorBoard Visualization of ResNet V1 50 Converted to a TensorRT Graph. TensorBoard Visualization of ResNet V1 50 Converted to a TensorRT Graph.

3.6. Using TF-TRT With ResNet V1 50

Here we walk through how to use the example Python scripts in the NVIDIA TensorFlow Docker containers under the nvidia-examples/inference/image-classification directory with the ResNet V1 50 model.

Using TF-TRT with precision modes lower than FP32, that is, FP16 and INT8, improves the performance of inference. The FP16 precision mode uses Tensor Cores or half-precision hardware instructions, if possible, while the INT8 precision mode uses Tensor Cores or integer hardware instructions. INT8 mode also requires running a calibration step, which the image_classification.py script does automatically.

Below we use image_classification.py to compare the accuracy and timing performance of all the precision modes when running inference using the ResNet V1 50 model.

Tutorial: Native TensorFlow Using FP32

This is our baseline session running inference using native TensorFlow without TensorRT integration/conversion.
python image_classification.py --model resnet_v1_50 \
    --data_dir /data/imagenet/train-val-tfrecord \
    --download_dir ./data \
    --cache 2>&1 | tee output_tf_native_resnet_v1_50
Note: We use the --download_dir and --cache parameters above to allow the script to cache checkpoint and frozen graph files to use with future sessions. Also, we use tee to capture the output to a file in order to check the accuracy with check_accuracy.py.
Results:
results of resnet_v1_50:
    accuracy: 75.90
    images/sec: ***
    99th_percentile(ms): ***
    total_time(s): ***
    latency_mean(ms): ***
Note: The accuracy metrics for the ResNet 50 models are taken from: Pre-trained model.

3.6.2. Tutorial: TF-TRT Using FP32

With this session, we use the same precision mode as in our native TensorFlow session (FP32), but this time we use the --use_trt parameter to convert the graph to a TensorRT optimized graph.
python image_classification.py --model resnet_v1_50 \
    --data_dir /data/imagenet/train-val-tfrecord \
    --use_trt \
    --download_dir ./data \
    --cache 2>&1 | tee output_tftrt_fp32_resnet_v1_50
Before the script starts running inference, it converts the TensorFlow graph to a TensorRT optimized graph with fewer nodes. Here the ResNet V1 50 model gets reduced from 741 native TensorFlow nodes to 14 total TF-TRT nodes:
num_nodes(native_tf): 741
num_nodes(tftrt_total): 14
num_nodes(trt_only): 2
time(s)(saving_frozen_graph): ***
time(s)(trt_conversion): ***
Note: For a list of supported operations that can be converted to a TensorRT graph, see Support Ops.
Results:
results of resnet_v1_50:
    accuracy: 75.90
    images/sec: ***
    99th_percentile(ms): ***
    total_time(s): ***
    latency_mean(ms): ***

3.6.3. Tutorial: TF-TRT Using FP16

With this session, we continue to use TF-TRT conversion, but we reduce the precision mode to FP16, allowing the use of Tensor Cores for performance improvements during inference, while preserving accuracy within the acceptable tolerance level (0.1%).
python image_classification.py --model resnet_v1_50 \
    --data_dir /data/imagenet/train-val-tfrecord \
    --use_trt \
    --precision fp16 \
    --download_dir ./data \
    --cache 2>&1 | tee output_tftrt_fp16_resnet_v1_50
Again, we see that the native TensorFlow graph gets converted to a TensorRT graph, from 741 native TensorFlow nodes to 14 total TF-TRT nodes:
num_nodes(native_tf): 741
num_nodes(tftrt_total): 14
num_nodes(trt_only): 2
time(s)(saving_frozen_graph): ***
time(s)(trt_conversion): ***
Results:
results of resnet_v1_50:
    accuracy: 75.91
    images/sec: ***
    99th_percentile(ms): ***
    total_time(s): ***
    latency_mean(ms): ***

3.6.4. Tutorial: TF-TRT Using INT8

For this session we continue to use TF-TRT conversion, and we reduce the precision further to INT8 for faster computation. Because INT8 has significantly lower precision and dynamic range than FP32, the INT8 precision mode requires an additional calibration step before performing the type conversion. In this calibration step, inference is first run with FP32 precision on a calibration dataset to generate many INT8 quantizations of the weights and activations in the trained TensorFlow graph, from which are chosen the INT8 quantizations that minimize information loss. For more details on the calibration process, see the 8-bit Inference with TensorRT presentation.

The calibration dataset should closely reflect the distribution of the problem dataset. In this walkthrough, we use the same ImageNet validation set training data for the calibration data, with the parameter --calib_data_dir /data/imagenet/train-val-tfrecord.

python image_classification.py --model resnet_v1_50 \
    --data_dir /data/imagenet/train-val-tfrecord \
    --use_trt \
    --precision int8 \
    --calib_data_dir /data/imagenet/train-val-tfrecord \
    --download_dir ./data \
    --cache 2>&1 | tee output_tftrt_int8_resnet_v1_50
Again, we see that the native TensorFlow graph gets converted to a TensorRT graph, from 741 native TensorFlow nodes to 14 total TF-TRT nodes:
num_nodes(native_tf): 741
num_nodes(tftrt_total): 14
num_nodes(trt_only): 2
time(s)(saving_frozen_graph): ***
time(s)(trt_conversion): ***
Results:
results of resnet_v1_50:
    accuracy: 75.90
    images/sec: ***
    99th_percentile(ms): ***
    total_time(s): ***
    latency_mean(ms): ***

3.7. Verified Models

We have verified that the following image classification models work with TF-TRT. Refer to the release notes for any related issues on these models.

Preliminary tests have been performed on other types of models, for example, object detection, translation, recommender systems, and reinforcement learning; which can be potentially optimized with TF-TRT. We will continue to publish more details on them.

In the following table, we’ve listed the accuracy numbers for each model that we validate against. Our validation runs inference on the whole ImageNet validation dataset and provides the top-1 accuracy.
Table 3. Verified Models
  Native TensorFlow FP32 TF-TRT FP32 TF-TRT FP16 TF-TRT INT8
  Volta and Turing Volta and Turing Volta and Turing Volta Turing
MobileNet v1 71.01 71.01 70.99 69.45 69.42
MobileNet v2 74.08 74.08 74.07 73.87 73.90
NASNet - Large 82.72 82.71 82.70 Work in progress 82.66
NASNet - Mobile 73.97 73.85 73.87 Work in progress 73.35
ResNet50 v1 75.90 75.90 75.92 68.19 68.19
ResNet50 v2 76.06 75.98 75.96 71.09 71.07
VGG16 70.89 70.89 70.91 70.84 70.78
VGG19 71.01 71.01 71.01 70.82 70.90
Inception v3 77.99 77.99 77.97 77.86 77.85
Inception v4 80.19 80.19 80.19 79.21 Work in progress

4. Samples

For specific tutorials and samples, see nvidia-examples/inference inside the TensorFlow container. For more information, see NGC TensorFlow container and the TensorFlow User Guide.

The TensorFlow samples include the following features:
  • Download checkpoints or pre-trained models from TensorFlow model zoo.
  • Run inference using either native TensorFlow or TF-TRT.
  • Achieve the accuracy that matches the accuracy obtained by TensorFlow slim or TensorFlow official scripts.
  • Report any metrics including throughput, latency (mean and 99th percentile), node conversion rates, top1 accuracy, total time.
  • Support precision modes FP32, FP16, and INT8 for TF-TRT.
  • Work with TFRecord dataset only. Tested the scripts with the ImageNet dataset.
  • Run benchmark with synthetic data in order to measure the performance of the inference only regardless of I/O pipeline.

5. Best Practices

Ensure you are familiar with the following best practice guidelines:
Batch normalization
The FusedBatchNorm operator is supported, which means this operator is converted to the relevant TensorRT batch normalization layers. This operator has an argument named is_training which is a boolean to indicate whether the operation is for training or inference. The operator is converted to TensorRT only if is_training=False.

When converting a model from Keras, ensure you call the function keras.backend.set_learning_phase(0) to ensure that your batch normalization layers are built in inference mode and therefore are eligible to be converted. We recommend to call this function at the very beginning of your python script, right after import keras.

Batch size
TensorRT optimizes the graph for a maximum batch size that must be provided during the conversion. During inference, while using this same maximum batch size provides the best performance, batch sizes smaller than this maximum may not give the best possible performance for that batch size. Also, running inference with batch sizes larger than the maximum batch size is not supported by TensorRT.
Conversion on the target machine
You need to execute the conversion on the machine on which you will run inference. This is because TensorRT optimizes the graph by using the available GPUs and thus the optimized graph may not perform well on a different GPU.
I/O bound
Inference workloads have a very low latency especially if the model is small and the inference engine is optimized with TF-TRT. In such cases, we often see that the bottleneck in the pipeline is loading inputs from disks or networks (such as jpeg images or TFRecords) and preprocessing them before feeding them into the inference engine. We recommend to always profile your inference during runtime and find the bottlenecks. A common method in improving the I/O pipeline is to use all available optimizations in the I/O libraries such as (tf.data or nvidia/dali). These available optimizations include multithreaded I/O and image processing and performing image processing on GPUs.
INT8 calibration conversion
After you convert the graph with INT8 precision mode, the converted TensorRT graph needs to be calibrated. The calibration is a very slow process that can take 1 hour.
Memory usage
TensorRT allocates memory using an allocator provided by TensorFlow. This means that if you specify the fraction of GPU memory allowed for TensorFlow (using the per_process_gpu_memory_fraction parameter of the GPUOptions), then TensorRT can use the remaining memory. As an example, per_process_gpu_memory_fraction=0.67 would allocate 67% of GPU memory for TensorFlow, making the remaining 33% available for TensorRT engines. The max_workspace_size_bytes parameter can be also used to set the maximum amount of memory that TensorRT should allocate.
Minimum segment size
TensorRT does not convert TensorFlow subgraphs with fewer nodes than the value defined by the minimum_segment_size parameter. Therefore, to achieve the best performance we recommend using the smallest possible value of minimum_segment_size for which the converter doesn’t crash. There are two ways to determine this smallest possible value.
  • You can start by setting minimum_segment_size to a large number such as 50 and decrease this number until the converter crashes. With this method you should see better performance as you decrease the minimum_segment_size parameter.
Or
  • You can start with minimum_segment_size set to a small number such as 2 or 3 and increase this number until the converter completes its process without crashing.
Number of nodes
Each TensorFlow graph has a certain number of nodes. The TF-TRT conversion always reduces the number of nodes through replacing a subset of those nodes by a single TensorRT node. For example, converting a TensorFlow graph of Res-Net with 743 nodes, could result in a new graph with 19 nodes out of which 1 node is a TensorRT node that will be executed by a TensorRT engine. A good way to find out whether any optimization has happened or how much of the graph is optimized is to compare the number of nodes before and after the conversion. We expect >90% of nodes to be replaced by TensorRT nodes for the supported models.
Tensor Cores
If you have a GPU with Tensor Core capability, you can simply set the precision mode to FP16 during the conversion, and then TensorRT will run the relevant operators on Tensor Cores.
Note:
  • Not all GPUs support the ops required for all precisions.
  • Tensor Cores can be used only for MatMul and convolutions if the dimensions are multiples of 8. To verify whether Tensor Cores are being used in your inference, you can profile your inference run with nvprof and check if all the GEMM CUDA kernels (GEMM is used by MatMul and convolution) have 884 in their name.

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, cuFFT, cuSPARSE, DIGITS, DGX, DGX-1, DGX Station, GRID, Jetson, Kepler, NVIDIA GPU Cloud, Maxwell, NCCL, NVLink, Pascal, Tegra, TensorRT, Tesla and Volta are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.