



TF-TRT leverages many of TensorRT’s capabilities to accelerate inference. Some of these capabilities are:

Mixed precision execution (FP32, FP16, and INT8)

INT8 quantization

Dynamic Batch and Input shapes

This section will give an overview of the above capabilities & provide usage / best practice examples on how to utilize them.



TensorRT can convert tensors and weights to lower precisions for faster inference during the optimization. The argument precision_mode sets the precision mode; which can be one of FP32 , FP16 , or INT8 . Precisions lower than FP32, such as FP16 and INT8, can extract higher performance out of TensorRT engines. The FP16 mode uses Tensor Cores or half precision hardware instructions , if possible. The INT8 precision mode uses integer hardware instructions .

Users are encouraged to try the reduced precision modes such as FP16 and INT8. FP16 will improve performance without substantial accuracy loss; models trained with AMP should have no loss. INT8 precision mode will have the best performance, however, the quantization error induced by INT8 quantization may introduce an accuracy drop in some models. See the Quantization section to understand this effect in more detail and how to mitigate it.

Depending on the application requirements (performance, memory consumption, accuracy), precision level should be selected. Regardless of choice, model validation is always recommended after conversion to TensorRT.

Users should note that selecting a lower precision mode does not mean that the whole network will run in that precision. TensorRT selects the fastest layer in the chosen precision or higher for best performance (see reduced precision ).

Quantization in Deep Learning refers to transforming the deep learning model’s parameters to perform computation at lower precision. This is a popular optimization which helps reduce the size of deep learning models, thereby speeding up inferences and reducing power consumption. This is useful in all deployments, but can be essential for deployment on embedded devices with lower computational power such as the NVIDIA Jetson .

To illustrate quantization with an example, imagine multiplying 3.999x2.999 and 4x3. The latter (integer quantized) operation is considerably faster to perform than the former. This is the speedup one strives to achieve by quantizing the numbers to a lower precision. Simply put, quantization is a process of mapping input values from a large range and fine granularity to output values in a smaller range and coarser granularity, thereby reducing precision.

In the context of deep learning, we often train deep learning models using floating-point 32-bit representation (FP32) as we can take advantage of a wider range of numbers. During model quantization, the model data–network parameters and activations–are converted from this floating point representation to a lower precision representation, typically using 8-bit integers (INT8). Unfortunately, this approximation may result in a lower model accuracy.

The main quantization method used in TF-TRT is Post-Training Quantization (PTQ). As the name suggests, Post Training Quantization is a technique used on a previously trained model to reduce the size of the model and gain throughput benefits while mitigating the cost to the model accuracy.

Since small rounding errors can propagate through the network and become increasingly impactful for the model accuracy, different quantization techniques, like quantization-aware training (QAT), have been developed to mitigate this effect. QAT is currently experimental in TF-TRT.



Post-Training Quantization (aka. PTQ) is called INT8-calibration in the context of TensorRT .

During the calibration stage, TensorRT uses the supplied input “calibration” data to estimate the best scale and bias values for each tensor of the network given its dynamic range and value distribution. TF-TRT stores this information collected in the converted model.

The TF-TRT workflow for using PTQ is fairly straightforward. A user needs to do the following:

During converter instantiation: Set the precision mode as INT8. Set the use_calibration flag to True.

During the conversion process, the user needs to pass in a representative input dataloader for calibration. It is important that the calibration input data represent the range of inputs that the model is expected to operate on for calibration to produce meaningful scale factors for activations; the more data, the more accurate the quantization. For example, the test data set (or some subset of it) is often a good data source.

Users should note that the calibration data is always expected to have a single shape.

The following Python example demonstrates calibration:

Copy Copied! from tensorflow.python.compiler.tensorrt import trt_convert as trt # Instantiate the TF-TRT converter converter = trt.TrtGraphConverterV2( input_saved_model_dir=SAVED_MODEL_DIR, precision_mode=trt.TrtPrecisionMode.INT8, use_calibration=True ) # Use data from the test/validation set to perform INT8 calibration BATCH_SIZE=32 NUM_CALIB_BATCHES=10 def calibration_input_fn(): for i in range(NUM_CALIB_BATCHES): start_idx = i * BATCH_SIZE end_idx = (i + 1) * BATCH_SIZE x = x_test[start_idx:end_idx, :] yield [x] # Convert the model with valid calibration data func = converter.convert(calibration_input_fn=calibration_input_fn) # Input for dynamic shapes profile generation MAX_BATCH_SIZE=128 def input_fn(): batch_size = MAX_BATCH_SIZE x = x_test[0:batch_size, :] yield [x] # Build the engine converter.build(input_fn=input_fn) OUTPUT_SAVED_MODEL_DIR="./models/tftrt_saved_model" converter.save(output_saved_model_dir=OUTPUT_SAVED_MODEL_DIR) converter.summary() # Run some inferences! for step in range(10): start_idx = step * BATCH_SIZE end_idx = (step + 1) * BATCH_SIZE print(f"Step: {step}") x = x_test[start_idx:end_idx, :] func(x)

A TensorFlow model can have input tensors with fixed or dynamic shapes. Before training, we typically set most of the input dimensions to a fixed value. For example, a model that takes a batch of 8 input images with 224x224 resolution and three color channels could have an input tensor with a fixed shape of [8, 224, 224, 3]. Depending on the model architecture, we can leave some of the input dimensions dynamic to allow inference with a wider range of input shapes: Typical examples of dynamic input shapes are:

Batch size – for example for an image classification model, the network input tensor can be [?, 224, 224, 3], where the batch size is unknown during model definition and is allowed to take different values during runtime.

Image size for fully convolutional networks [8, ?, ?, 3]

Sequence length of transformer models. For example a BERT encoder has input tensors with shape [N, S], where N is the batch size and S is the sequence length, and both of these dimensions can be dynamic.

TF-TRT supports models with dynamic shape via user-provided information about the range of input shapes that the converted model should support.

By default TF-TRT allows dynamic batch size. The maximum batch size (N) is set as the batch size that was used to build the engines for the converted model. Such a model would support any batch size between [1..N]. (also called implicit batch mode). If we try to infer the model with larger batch size, then TF-TRT will build another engine to do so. This has significant performance impacts as engine building is expensive. The allow_build_at_runtime and max_cached engines conversion parameters control TF-TRT’s runtime engine building behavior.

To support dynamic input dimensions other than the batch dimension, we need to enable dynamic shape mode by passing use_dynamic_shape=True argument to the converter. The dynamic shape mode in TF-TRT utilizes TensorRT’s dynamic shape feature to improve the conversion rate of networks and handle networks with unknown input shapes efficiently. An increased conversion rate means that more of the network can be run in TensorRT. This improves the performance of such networks when used with TF-TRT.

Apart from enabling the use_dynamic_shape flag, TF-TRT needs to be provided information about the range of shapes that are expected during inference, as in the following Example .

Copy Copied! # Instantiate the TF-TRT converter # Instantiate the TF-TRT converter PROFILE_STRATEGY="Optimal" converter = trt.TrtGraphConverterV2( input_saved_model_dir=bert_saved_model_path, precision_mode=trt.TrtPrecisionMode.FP32, use_dynamic_shape=True, dynamic_shape_profile_strategy=PROFILE_STRATEGY) # Convert the model to TF-TRT converter.convert() VOCAB_SIZE = 30522 # Model specific, look in the model README. # Build engines for input sequence lengths of 128, and 384. input_shapes = [[(1, 128), (1, 128), (1, 128)], [(1, 384), (1, 384), (1, 384)]] def input_fn(): for shapes in input_shapes: # return a list of input tensors yield [tf.convert_to_tensor( np.random.randint(low=0, high=VOCAB_SIZE, size=x,dtype=np.int32)) for x in shapes] converter.build(input_fn)

Before saving the converted model, it is built to handle a certain range of input parameters, by using the input_fn. Unlike calibration inputs, these inputs do not need to represent real input data, for most of the models only the input shapes matter; data-dependent shapes are the exception to this.

The example above illustrates a BERT like model, which has three input tensors. Our input_fn defines two different input shapes one with sequence length 128 and one with sequence length 384.

Dynamic inputs can be further specified with the dynamic_shape_profile_strategy argument. This parameter selects the strategy for defining optimization profiles for TensorRT (where “optimization profile” is TensorRT’s terminology for describing input shape information). The following are options for optimization profiles:

Range : create one profile that works for inputs with dimension values in the range of [min_dims, max_dims] where min_dims and max_dims are derived from the provided inputs.

: create one profile that works for inputs with dimension values in the range of [min_dims, max_dims] where min_dims and max_dims are derived from the provided inputs. Optimal : create one profile for each input. The profile only works for inputs with the same dimensions as the input it is created for. The GPU engine will be run with optimal performance with such inputs.

: create one profile for each input. The profile only works for inputs with the same dimensions as the input it is created for. The GPU engine will be run with optimal performance with such inputs. Range+Optimal : create the profiles for both Range and Optimal.

: create the profiles for both Range and Optimal. ImplicitBatchModeCompatible: create the profiles that will produce the same GPU engines as the implicit_batch_mode would produce.

The following image and table illustrate how the profile strategy influences the range of shapes accepted by the converted model.

Input Shapes Dynamic Shape Profile Strategy Output Profiles Use Case [8, 128], [4, 384] Range [4-8, 128-384] Handles a range of inputs for both dimensions [8, 128], [4, 384] Optimal [8, 128], [4, 384] Best performance for concrete input shapes [8, 128], [4, 384] Range + Optimal [8, 128], [4, 384], [4-8, 128-384] Best performance for the concrete inputs, handles any input in the range [8, 128], [4, 384] ImplicitBatchModeCompatible [1-8, 128], [1-4, 384] Flexible batch size for each sequence length

If only a small number of concrete input shapes are expected, then it is recommended to use the “Optimal” strategy.

If build() is not called, then the TensoRT engine creation will take place when the converted model is first inferred. The input shape used during this inference will set the TensorRT profile strategy to the default strategy, Range, with parameters min_dims=max_dims.

Note: Users can set “ use_dynamic_shapes=True ” for graphs that have static inputs, and it often results in improved conversion rate.

The following is a simple Python example demonstrating conversion of a BERT model with random inputs.



Copy Copied! # Prerequisite: Install the python module below before running this example. # pip install -q tf-models-official import tensorflow as tf import tensorflow_hub as hub tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3' bert_saved_model_path = './models/bert_base' bert_model = hub.load(tfhub_handle_encoder) tf.saved_model.save(bert_model, bert_saved_model_path) import numpy as np from tensorflow.python.saved_model import signature_constants from tensorflow.python.saved_model import tag_constants from tensorflow.python.compiler.tensorrt import trt_convert as trt # Instantiate the TF-TRT converter PROFILE_STRATEGY="Optimal" converter = trt.TrtGraphConverterV2( input_saved_model_dir=bert_saved_model_path, precision_mode=trt.TrtPrecisionMode.FP32, use_dynamic_shape=True, dynamic_shape_profile_strategy=PROFILE_STRATEGY) # Convert the model to TF-TRT converter.convert() VOCAB_SIZE = 30522 # Model specific, look in the model README. # Build engines for input sequence lengths of 128, and 384. input_shapes = [[(1, 128), (1, 128), (1, 128)], [(1, 384), (1, 384), (1, 384)]] def input_fn(): for shapes in input_shapes: # return a list of input tensors yield [tf.convert_to_tensor( np.random.randint(low=0, high=VOCAB_SIZE, size=x,dtype=np.int32)) for x in shapes] converter.build(input_fn) # Save the converted model bert_trt_path = "./models/tftrt_bert_base" converter.save(bert_trt_path) converter.summary() # Some helper functions def get_func_from_saved_model(saved_model_dir): saved_model_loaded = tf.saved_model.load( saved_model_dir, tags=[tag_constants.SERVING]) graph_func = saved_model_loaded.signatures[ signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY] return graph_func, saved_model_loaded def get_random_input(batch_size, seq_length): # Generate random input data mask = tf.convert_to_tensor(np.ones((batch_size, seq_length), dtype=np.int32)) type_id = tf.convert_to_tensor(np.zeros((batch_size, seq_length), dtype=np.int32)) word_id = tf.convert_to_tensor( np.random.randint(0, VOCAB_SIZE, size=[batch_size, seq_length], dtype=np.int32)) return {'input_mask':mask, 'input_type_ids': type_id, 'input_word_ids':word_id} # Get a random input tensor input_tensor = get_random_input(1, 128) # Specify the output tensor interested in. This output is the 'classifier' result_key = 'bert_encoder_1' trt_func, _ = get_func_from_saved_model(bert_trt_path) ## Let's run some inferences! for i in range(0, 10): print(f"Step: {i}") preds = trt_func(**input_tensor) result = preds[result_key]

Tensorflow supports executing models in C++ including TF-TRT converted models. The C++ API for the TF-TRT converter is presently experimental.



First convert the model via the TF-TRT Python APIs.

The C++ workflow for loading and running models is explained below for a model with synthetic data.

Initialize the global states required by Tensorflow Load the saved model and initialize the TF session Set up inputs Set up outputs Run the inference Release resources

An example showing the above steps:

Initialize the global states required by Tensorflow and the TF session. Copy Copied! // We need to call this to set up global state for TensorFlow. tensorflow::port::InitMain(argv[0], &argc, &argv); if (argc > 1) { LOG(ERROR) << "Unknown argument " << argv[1] << "

" << usage; return -1; } Load the saved model and initialize the TF session. Copy Copied! // Some helper functions // Returns info for nodes listed in the signature definition. std::vector<tensorflow::TensorInfo> GetNodeInfo( const google::protobuf::Map<string, tensorflow::TensorInfo>& signature) { std::vector<tensorflow::TensorInfo> info; for (const auto& item : signature) { info.push_back(item.second); } return info; } // Load the `SavedModel` located at `model_dir`. Status LoadModel(const string& model_dir, const string& signature_key, tensorflow::SavedModelBundle* bundle, std::vector<tensorflow::TensorInfo>* input_info, std::vector<tensorflow::TensorInfo>* output_info) { tensorflow::RunOptions run_options; tensorflow::SessionOptions sess_options; tensorflow::OptimizerOptions* optimizer_options = sess_options.config.mutable_graph_options()->mutable_optimizer_options(); optimizer_options->set_opt_level(tensorflow::OptimizerOptions::L0); optimizer_options->set_global_jit_level(tensorflow::OptimizerOptions::OFF); sess_options.config.mutable_gpu_options()->force_gpu_compatible(); TF_RETURN_IF_ERROR(tensorflow::LoadSavedModel(sess_options, run_options, model_dir, {"serve"}, bundle)); // Get input and output names auto signature_map = bundle->GetSignatures(); const tensorflow::SignatureDef& signature = signature_map[signature_key]; *input_info = GetNodeInfo(signature.inputs()); *output_info = GetNodeInfo(signature.outputs()); return Status::OK(); } Set up inputs. Here we are using synthetic data and placing it on the device ahead of inference. Copy Copied! // Create random inputs matching `input_info` Status SetupInputs(int32_t batch_size, int32_t input_size, std::vector<tensorflow::TensorInfo>& input_info, std::vector<std::pair<std::string, tensorflow::Tensor>>* inputs) { //std::vector<std::pair<std::string, tensorflow::Tensor>> input_tensors; for (auto& info : input_info) { // Set input batch size auto* shape = info.mutable_tensor_shape(); shape->mutable_dim(0)->set_size(batch_size); // Set dynamic dims to static size for (size_t i = 1; i < shape->dim_size(); i++) { auto* dim = shape->mutable_dim(i); if (dim->size() < 0) { dim->set_size(input_size); } } // Allocate memory and fill host tensor Tensor input_tensor(info.dtype(), *shape); std::fill_n((uint8_t*)input_tensor.data(), input_tensor.AllocatedBytes(), 1); inputs->push_back({info.name(), input_tensor}); } return Status::OK(); } Set up outputs. Copy Copied! // Get output tensor names based on `output_info`. Status SetupOutputs(std::vector<tensorflow::TensorInfo>& output_info, std::vector<string>* output_names, std::vector<Tensor>* outputs) { for (auto& info : output_info) { output_names->push_back(info.name()); outputs->push_back({}); } return Status::OK(); } Run the inference. Copy Copied! // Setup inputs std::vector<std::pair<std::string, tensorflow::Tensor>> inputs; TFTRT_ENSURE_OK(SetupInputs(batch_size, input_size, input_info, &inputs)); // Setup outputs std::vector<string> output_names; std::vector<Tensor> outputs; TFTRT_ENSURE_OK(SetupOutputs(output_info, &output_names, &outputs)); int num_iterations = 10; for (int i = 0; i < num_iterations; i++) { LOG(INFO) << "Step: " << i; TFTRT_ENSURE_OK( bundle.session->Run(inputs, output_names, {}, &outputs)); }

Here is a program that ties all of the above functions together:

Copy Copied! int main(int argc, char* argv[]) { // Parse arguments string model_path = "/path/to/model/"; string signature_key = "serving_default"; int32_t batch_size = 64; int32_t input_size = 128; std::vector<Flag> flag_list = { Flag("model_path", &model_path, "graph to be executed"), Flag("signature_key", &signature_key, "the serving signature to use"), Flag("batch_size", &batch_size, "batch size to use for inference"), Flag("input_size", &input_size, "shape to use for -1 input dims"), }; string usage = tensorflow::Flags::Usage(argv[0], flag_list); const bool parse_result = tensorflow::Flags::Parse(&argc, argv, flag_list); if (!parse_result) { LOG(ERROR) << usage; return -1; } // We need to call this to set up global state for TensorFlow. tensorflow::port::InitMain(argv[0], &argc, &argv); if (argc > 1) { LOG(ERROR) << "Unknown argument " << argv[1] << "

" << usage; return -1; } // Setup TF session tensorflow::SavedModelBundle bundle; std::vector<tensorflow::TensorInfo> input_info; std::vector<tensorflow::TensorInfo> output_info; TFTRT_ENSURE_OK( LoadModel(model_path, signature_key, &bundle, &input_info, &output_info)); // Setup inputs std::vector<std::pair<std::string, tensorflow::Tensor>> inputs; TFTRT_ENSURE_OK(SetupInputs(batch_size, input_size, input_info, &inputs)); // Setup outputs std::vector<string> output_names; std::vector<Tensor> outputs; TFTRT_ENSURE_OK(SetupOutputs(output_info, &output_names, &outputs)); int num_iterations = 10; for (int i = 0; i < num_iterations; i++) { LOG(INFO) << "Step: " << i; TFTRT_ENSURE_OK( bundle.session->Run(inputs, output_names, {}, &outputs)); } return 0; }