Abstract

This TensorRT 6.0.1 Developer Guide demonstrates how to use the C++ and Python APIs for implementing the most common deep learning layers. It shows how you can take an existing model built with a deep learning framework and use that to build a TensorRT engine using the provided parsers. The Developer Guide also provides step-by-step instructions for common user tasks such as, creating a TensorRT network definition, invoking the TensorRT builder, serializing and deserializing, and how to feed the engine with data and perform inference; all while using either the C++ or Python API.

For previously released TensorRT developer documentation, see TensorRT Archives.

1. What Is TensorRT?

The core of TensorRT™ is a C++ library that facilitates high performance inference on NVIDIA graphics processing units (GPUs). It is designed to work in a complementary fashion with training frameworks such as TensorFlow, Caffe, PyTorch, MXNet, etc. It focuses specifically on running an already trained network quickly and efficiently on a GPU for the purpose of generating a result (a process that is referred to in various places as scoring, detecting, regression, or inference).

Some training frameworks such as TensorFlow have integrated TensorRT so that it can be used to accelerate inference within the framework. Alternatively, TensorRT can be used as a library within a user application. It includes parsers for importing existing models from Caffe, ONNX, or TensorFlow, and C++ and Python APIs for building models programmatically.

Figure 1. TensorRT is a high performance neural network inference optimizer and runtime engine for production deployment.

TensorRT optimizes the network by combining layers and optimizing kernel selection for improved latency, throughput, power efficiency and memory consumption. If the application specifies, it will additionally optimize the network to run in lower precision, further increasing performance and reducing memory requirements.

The following figure shows TensorRT defined as part high-performance inference optimizer and part runtime engine. It can take in neural networks trained on these popular frameworks, optimize the neural network computation, generate a light-weight runtime engine (which is the only thing you need to deploy to your production environment), and it will then maximize the throughput, latency, and performance on these GPU platforms.

Figure 2. TensorRT is a programmable inference accelerator.

The TensorRT API includes implementations for the most common deep learning layers. For more information about the layers, see TensorRT Layers. You can also use the C++ Plugin API or Python Plugin API to provide implementations for infrequently used or more innovative layers that are not supported out-of-the-box by TensorRT.

1.1. Benefits Of TensorRT

After the neural network is trained, TensorRT enables the network to be compressed, optimized and deployed as a runtime without the overhead of a framework.

TensorRT combines layers, optimizes kernel selection, and also performs normalization and conversion to optimized matrix math depending on the specified precision (FP32, FP16 or INT8) for improved latency, throughput, and efficiency.

For deep learning inference, there are 5 critical factors that are used to measure software:
Throughput
The volume of output within a given period. Often measured in inferences/second or samples/second, per-server throughput is critical to cost-effective scaling in data centers.
Efficiency
Amount of throughput delivered per unit-power, often expressed as performance/watt. Efficiency is another key factor to cost effective data center scaling, since servers, server racks and entire data centers must operate within fixed power budgets.
Latency
Time to execute an inference, usually measured in milliseconds. Low latency is critical to delivering rapidly growing, real-time inference-based services.
Accuracy
A trained neural network’s ability to deliver the correct answer. For image classification based usages, the critical metric is expressed as a top-5 or top-1 percentage.
Memory usage
The host and device memory that need to be reserved to do inference on a network depends on the algorithms used. This constrains what networks and what combinations of networks can run on a given inference platform. This is particularly important for systems where multiple networks are needed and memory resources are limited - such as cascading multi-class detection networks used in intelligent video analytics and multi-camera, multi-network autonomous driving systems.
Alternatives to using TensorRT include:
  • Using the training framework itself to perform inference.
  • Writing a custom application that is designed specifically to execute the network using low level libraries and math operations.

Using the training framework to perform inference is easy, but tends to result in much lower performance on a given GPU than would be possible with an optimized solution like TensorRT. Training frameworks tend to implement more general purpose code which stress generality and when they are optimized the optimizations tend to focus on efficient training.

Higher efficiency can be obtained by writing a custom application just to execute a neural network, however it can be quite labor intensive and require quite a bit of specialized knowledge to reach a high level of performance on a modern GPU. Furthermore, optimizations that work on one GPU may not translate fully to other GPUs in the same family and each generation of GPU may introduce new capabilities that can only be leveraged by writing new code.

TensorRT solves these problems by combining an API with a high level of abstraction from the specific hardware details and an implementation which is developed and optimized specifically for high throughput, low latency, and low device memory footprint inference.

1.1.1. Who Can Benefit From TensorRT

TensorRT is intended for use by engineers who are responsible for building features and applications based on new or existing deep learning models or deploying models into production environments. These deployments might be into servers in a datacenter or cloud, in an embedded device, robot or vehicle, or application software which will run on users workstations.

TensorRT has been used successfully across a wide range of scenarios, including:
Robots
Companies sell robots using TensorRT to run various kinds of computer vision models to autonomously guide an unmanned aerial system flying in dynamic environments.
Autonomous Vehicles
TensorRT is used to power computer vision in the NVIDIA Drive products.
Scientific and Technical Computing
A popular technical computing package embeds TensorRT to enable high throughput execution of neural network models.
Deep Learning Training and Deployment Frameworks
TensorRT is included in several popular Deep Learning Frameworks including TensorFlow and MXNet. For TensorFlow and MXNet container release notes, see TensorFlow Release Notes and MXNet Release Notes.
Video Analytics
TensorRT is used in NVIDIA’s DeepStream product to power sophisticated video analytics solutions both at the edge with 1 - 16 camera feeds and in the datacenter where hundreds or even thousands of video feeds might come together.
Automatic Speech Recognition
TensorRT is used to power speech recognition on a small tabletop/desktop device. A limited vocabulary is supported on the device with a larger vocabulary speech recognition system available in the cloud.

1.2. Where Does TensorRT Fit?

Generally, the workflow for developing and deploying a deep learning model goes through three phases.
  • Phase 1 is training
  • Phase 2 is developing a deployment solution, and
  • Phase 3 is the deployment of that solution

Phase 1: Training

During the training phase, the data scientists and developers will start with a statement of the problem they want to solve and decide on the precise inputs, outputs and loss function they will use. They will also collect, curate, augment, and probably label the training, test and validation data sets. Then they will design the structure of the network and train the model. During training, they will monitor the learning process which may provide feedback which will cause them to revise the loss function, acquire or augment the training data. At the end of this process, they will validate the model performance and save the trained model. Training and validation is usually done using DGX-1™ , Titan, or Tesla datacenter GPUs.

TensorRT is generally not used during any part of the training phase.

Phase 2: Developing A Deployment Solution

During the second phase, the data scientists and developers will start with the trained model and create and validate a deployment solution using this trained model. Breaking this phase down into steps, you get:
  1. Think about how the neural network functions within the larger system of which it is a part of and design and implement an appropriate solution. The range of systems that might incorporate neural networks are tremendously diverse. Examples include:
    • the autonomous driving system in a vehicle
    • a video security system on a public venue or corporate campus
    • the speech interface to a consumer device
    • an industrial production line automated quality assurance system
    • an online retail system providing product recommendations, or
    • a consumer web service offering entertaining filters users can apply to uploaded images.
    Determine what your priorities are. Given the diversity of different systems that you could implement, there are a lot of things that may need to be considered for designing and implementing the deployment architecture.
    • Do you have a single network or many networks? For example, Are you developing a feature or system that is based on a single network (face detection), or will your system be comprised of a mixture or cascade of different models, or perhaps a more general facility that serves up a collection model that may be provided by the end user?
    • What device or compute element will you use to run the network? CPU, GPU, other, or a mixture? If the model is going to run on a GPU, is it a single type of GPU, or do you need to design an application that can run on a variety of GPUs?
    • How is data going to get to the models? What is the data pipeline? Is the data coming in from a camera or sensor, from a series of files, or being uploaded over a network connection?
    • What pre-processing will be done? What format will the data come in? If it is an image does it need to be cropped, rotated? If it is text what character set is it and are all characters allowed as inputs to the model? Are there any special tokens?
    • What latency and throughput requirements will you have?
    • Will you be able to batch together multiple requests?
    • Will you need multiple instances of a single network to achieve the required overall system throughput and latency?
    • What will you do with the output of the network?
    • What post processing steps are needed?

    TensorRT provides a fast, modular, compact, robust, reliable inference engine that can support the inference needs within the deployment architecture.

  2. After the data scientists and developers define the architecture of their inference solution, by which they determine what their priorities are, they then build an inference engine from the saved network using TensorRT. There are a number of ways to do this depending on the training framework used and the network architecture. Generally, this means you need to take the saved neural network and parse it from its saved format into TensorRT using the ONNX parser (see Figure 3), Caffe parser, or TensorFlow/UFF parser.
    Figure 3. ONNX Workflow V1
  3. After the network is being parsed, you’ll need to consider optimization options -- batch size, workspace size and mixed precision. These options are chosen and specified as part of the TensorRT build step where you actually build an optimized inference engine based on your network. Subsequent sections of this guide provides detailed instructions and numerous examples on this part of the workflow, parsing your model into TensorRT and choosing the optimization parameters (see Figure 4).
    Figure 4. TensorRT optimizes trained neural network models to produce a deployment-ready runtime inference engine.
  4. After you’ve created an inference engine using TensorRT, you’ll want to validate that it reproduces the results of the model as measured during the training process. If you have chosen FP32 or FP16 it should match the results quite closely. If you have chosen INT8 there may be a small gap between the accuracy achieved during training and the inference accuracy.
  5. Write out the inference engine in a serialized format. This is also called a plan file.

Phase 3: Deploying A Solution

The TensorRT library will be linked into the deployment application which will call into the library when it wants an inference result. To initialize the inference engine, the application will first deserialize the model from the plan file into an inference engine.

TensorRT is usually used asynchronously, therefore, when the input data arrives, the program calls an enqueue function with the input buffer and the buffer in which TensorRT should put the result.

1.3. How Does TensorRT Work?

To optimize your model for inference, TensorRT takes your network definition, performs optimizations including platform specific optimizations, and generates the inference engine. This process is referred to as the build phase. The build phase can take considerable time, especially when running on embedded platforms. Therefore, a typical application will build an engine once, and then serialize it as a plan file for later use.
Note: The generated plan files are not portable across platforms or TensorRT versions. Plans are specific to the exact GPU model they were built on (in addition to the platforms and the TensorRT version) and must be re-targeted to the specific GPU in case you want to run them on a different GPU.
The build phase performs the following optimizations on the layer graph:
  • Elimination of layers whose outputs are not used
  • Elimination of operations which are equivalent to no-op
  • Fusion of convolution, bias and ReLU operations
  • Aggregation of operations with sufficiently similar parameters and the same source tensor (for example, the 1x1 convolutions in GoogleNet v5’s inception module)
  • Merging of concatenation layers by directing layer outputs to the correct eventual destination.

The builder also modifies the precision of weights if necessary. When generating networks in 8-bit integer precision, it uses a process called calibration to determine the dynamic range of intermediate activations, and hence the appropriate scaling factors for quantization.

In addition, the build phase also runs layers on dummy data to select the fastest from its kernel catalog, and performs weight pre-formatting and memory optimization where appropriate.

For more information, see Working With Mixed Precision.

1.4. What Capabilities Does TensorRT Provide?

TensorRT enables developers to import, calibrate, generate, and deploy optimized networks. Networks can be imported directly from Caffe, or from other frameworks via the UFF or ONNX formats. They may also be created programmatically by instantiating individual layers and setting parameters and weights directly.

Users can also run custom layers through TensorRT using the Plugin interface. The GraphSurgeon utility provides the ability to map TensorFlow nodes to custom layers in TensorRT, thus enabling inference for many TensorFlow networks with TensorRT.

TensorRT provides a C++ implementation on all supported platforms, and a Python implementation on x86, aarch64, and ppc64le.

The key interfaces in the TensorRT core library are:
Network Definition
The Network Definition interface provides methods for the application to specify the definition of a network. Input and output tensors can be specified, layers can be added, and there is an interface for configuring each supported layer type. As well as layer types, such as convolutional and recurrent layers, and a Plugin layer type allows the application to implement functionality not natively supported by TensorRT. For more information about the Network Definition, see Network Definition API.
Builder
The Builder interface allows creation of an optimized engine from a network definition. It allows the application to specify the maximum batch and workspace size, the minimum acceptable level of precision, timing iteration counts for autotuning, and an interface for quantizing networks to run in 8-bit precision. For more information about the Builder, see Builder API.
Engine
The Engine interface allows the application to execute inference. It supports synchronous and asynchronous execution, profiling, and enumeration and querying of the bindings for the engine inputs and outputs. A single engine can have multiple execution contexts, allowing a single set of trained parameters to be used for the simultaneous execution of multiple batches. For more information about the Engine, see Execution API.
TensorRT provides parsers for importing trained networks to create network definitions:
Caffe Parser
This parser can be used to parse a Caffe network created in BVLC Caffe or NVCaffe 0.16. It also provides the ability to register a plugin factory for custom layers. For more details on the C++ Caffe Parser, see NvCaffeParser or the Python Caffe Parser.

UFF Parser
This parser can be used to parse a network in UFF format. It also provides the ability to register a plugin factory and pass field attributes for custom layers. For more details on the C++ UFF Parser, see NvUffParser or the Python UFF Parser.

ONNX Parser
This parser can be used to parse an ONNX model. For more details on the C++ ONNX Parser, see NvONNXParser or the Python ONNX Parser.
Restriction: Since the ONNX format is quickly developing, you may encounter a version mismatch between the model version and the parser version. The ONNX Parser shipped with TensorRT 5.1.x supports ONNX IR (Intermediate Representation) version 0.0.3, opset version 9.

Note: Additionally, some TensorRT Caffe and ONNX parsers and plugins can be found on GitHub.

1.5. How Do I Get TensorRT?

For step-by-step instructions on how to install TensorRT, see the TensorRT Installation Guide.

2. Using The C++ API

The following sections highlight the TensorRT user goals and tasks that you can perform using the C++ API. Further details are provided in the Samples Support Guide and are linked to below where appropriate.

The assumption is that you are starting with a trained model. This chapter will cover the following necessary steps in using TensorRT:
  • Creating a TensorRT network definition from your model
  • Invoking the TensorRT builder to create an optimized runtime engine from the network
  • Serializing and deserializing the engine so that it can be rapidly recreated at runtime
  • Feeding the engine with data to perform inference

C++ API vs Python API

In essence, the C++ API and the Python API should be close to identical in supporting your needs. The C++ API should be used in any performance critical scenarios, as well as in situations where safety is important, for example, like in automotive.

The main benefit of the Python API is that data preprocessing and postprocessing is easy to use because you’re able to use a variety of libraries like NumPy and SciPy. For more information about the Python API, see Using The Python API.

2.1. Instantiating TensorRT Objects in C++

In order to run inference, you need to use the IExecutionContext object. In order to create an object of type IExecutionContext, you first need to create an object of type ICudaEngine (the engine).

The engine can be created in one of two ways:
  • via the network definition from the user model. In this case, the engine can be optionally serialized and saved for later use.
  • by reading the serialized engine from the disk. In this case, the performance is better, since the steps of parsing the model and creating intermediate objects are bypassed.
An object of type ILogger needs to be created globally. It is used as an argument to various methods of TensorRT API. A simple example demonstrating the creation of the logger is shown here:
class Logger : public ILogger           
 {
     void log(Severity severity, const char* msg) override
     {
         // suppress info-level messages
         if (severity != Severity::kINFO)
             std::cout << msg << std::endl;
     }
 } gLogger;

A global TensorRT API method called createInferBuilder(gLogger) is used to create an object of type IBuilder. For more information, see IBuilder class reference.

A method called createNetwork defined for IBuilder is used to create an object of type INetworkDefinition.

One of the available parsers is created (Caffe, ONNX, or UFF) using the INetwork definition as the input:
  • ONNX: auto parser = nvonnxparser::createParser(*network, gLogger);
  • Caffe: auto parser = nvcaffeparser1::createCaffeParser();
  • UFF: auto parser = nvuffparser::createUffParser();
A method called parse() from the object of type IParser is called to read the model file and populate the TensorRT network.

A method called buildCudaEngine() of IBuilder is called to create an object of ICudaEngine type.

The engine can be optionally serialized and dumped into the file.

The execution context is used to perform inference.

If the serialized engine is preserved and saved to a file, you can bypass most of the steps described above.

A global TensorRT API method called createInferRuntime(gLogger) is used to create an object of type IRuntime.

The rest of the inference is identical for those two usage models.

Even though it is possible to avoid creating the CUDA context, (the default context will be created for you), it is not advisable. It is recommended to create and configure the CUDA context before creating a runtime or builder object.

The builder or runtime will be created with the GPU context associated with the creating thread. Although a default context will be created if it does not already exist, it is advisable to create and configure the CUDA context before creating a runtime or builder object.

2.2. Creating A Network Definition In C++

The first step in performing inference with TensorRT is to create a TensorRT network from your model. The easiest way to achieve this is to import the model using the TensorRT parser library, which supports serialized models in the following samples:

An alternative is to define the model directly using the TensorRT API. This requires you to make a small number of API calls to define each layer in the network graph, and to implement your own import mechanism for the model’s trained parameters.

In either case, you will explicitly need to tell TensorRT which tensors are required as outputs of inference. Tensors which are not marked as outputs are considered to be transient values that may be optimized away by the builder. There is no restriction on the number of output tensors, however, marking a tensor as an output may prohibit some optimizations on that tensor. Inputs and output tensors must also be given names (using ITensor::setName()). At inference time, you will supply the engine with an array of pointers to input and output buffers. In order to determine in which order the engine expects these pointers, you can query using the tensor names.

An important aspect of a TensorRT network definition is that it contains pointers to model weights, which are copied into the optimized engine by the builder. If a network was created via a parser, the parser will own the memory occupied by the weights, and so the parser object should not be deleted until after the builder has run.

2.2.1. Creating A Network Definition From Scratch Using The C++ API

Instead of using a parser, you can also define the network directly to TensorRT via the network definition API. This scenario assumes that the per-layer weights are ready in host memory to pass to TensorRT during the network creation.

In the following example, we will create a simple network with Input, Convolution, Pooling, FullyConnected, Activation and SoftMax layers. To see the code in totality, refer to Building A Simple MNIST Network Layer By Layer (sampleMNISTAPI) located in the opensource/sampleMNISTAPI directory in the GitHub repository.

  1. Create the builder and the network:
    IBuilder* builder = createInferBuilder(gLogger);
    INetworkDefinition* network = builder->createNetwork();
  2. Add the Input layer to the network, with the input dimensions. A network can have multiple inputs, although in this sample there is only one:
    auto data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});
    
  3. Add the Convolution layer with hidden layer input nodes, strides and weights for filter and bias. In order to retrieve the tensor reference from the layer, we can use:
    auto conv1 = network->addConvolution(*data->getOutput(0), 20, DimsHW{5, 5}, weightMap["conv1filter"], weightMap["conv1bias"]);
    conv1->setStride(DimsHW{1, 1});
    Note: Weights passed to TensorRT layers are in host memory.
  4. Add the Pooling layer:
    auto pool1 = network->addPooling(*conv1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});
    pool1->setStride(DimsHW{2, 2});
  5. Add the FullyConnected and Activation layers:
    auto ip1 = network->addFullyConnected(*pool1->getOutput(0), 500, weightMap["ip1filter"], weightMap["ip1bias"]);
    auto relu1 = network->addActivation(*ip1->getOutput(0), ActivationType::kRELU);
    
  6. Add the SoftMax layer to calculate the final probabilities and set it as the output:
    auto prob = network->addSoftMax(*relu1->getOutput(0));
    prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);
  7. Mark the output:
    network->markOutput(*prob->getOutput(0));
    

2.2.2. Importing A Model Using A Parser In C++

To import a model using the C++ Parser API, you will need to perform the following high-level steps:
  1. Create the TensorRT builder and network.
    IBuilder* builder = createInferBuilder(gLogger);
    nvinfer1::INetworkDefinition* network = builder->createNetwork();
    For an example on how to create the logger, see Instantiating TensorRT Objects in C++.
  2. Create the TensorRT parser for the specific format.
    ONNX
    auto parser = nvonnxparser::createParser(*network, gLogger);
    UFF
    auto parser = nvuffparser::createUffParser();
    Caffe
    auto parser = nvcaffeparser1::createCaffeParser();
  3. Use the parser to parse the imported model and populate the network.
    parser->parse(args);
    The specific args depend on what format parser is used. For more information, refer to the parsers documented in the TensorRT API.

The builder must be created before the network because it serves as a factory for the network. Different parsers have different mechanisms for marking network outputs.

2.2.3. Importing A Caffe Model Using The C++ Parser API

The following steps illustrate how to import a Caffe model using the C++ Parser API. For more information, see "Hello World" For TensorRT (sampleMNIST) located in the GitHub repository.
  1. Create the builder and network:
    IBuilder* builder = createInferBuilder(gLogger);
    INetworkDefinition* network = builder->createNetwork();
    
  2. Create the Caffe parser:
    ICaffeParser* parser = createCaffeParser();
  3. Parse the imported model:
    const IBlobNameToTensor* blobNameToTensor = parser->parse("deploy_file" , "modelFile", *network, DataType::kFLOAT);
    

    This populates the TensorRT network from the Caffe model. The final argument instructs the parser to generate a network whose weights are 32-bit floats. Using DataType::kHALF would generate a model with 16-bit weights instead.

    In addition to populating the network definition, the parser returns a dictionary that maps from Caffe blob names to TensorRT tensors. Unlike Caffe, a TensorRT network definition has no notion of in-place operation. When a Caffe model uses an in-place operation, the TensorRT tensor returned in the dictionary corresponds to the last write to that blob. For example, if a convolution writes to a blob and is followed by an in-place ReLU, that blob’s name will map to the TensorRT tensor which is the output of the ReLU.

  4. Specify the outputs of the network:
    for (auto& s : outputs)
        network->markOutput(*blobNameToTensor->find(s.c_str()));

2.2.4. Importing A TensorFlow Model Using The C++ UFF Parser API

Note: For new projects, it’s recommended to use the TF-TRT integration as a method for converting your TensorFlow network to use TensorRT for inference. For integration instructions, see Accelerating Inference In TF-TRT User Guide and its Release Notes.

Importing from the TensorFlow framework requires you to convert the TensorFlow model into intermediate format UFF (Universal Framework Format). For more information about the conversion, see Converting A Frozen Graph To UFF.

The following steps illustrate how to import a TensorFlow model using the C++ Parser API. For more information about the UFF import, see Importing A TensorFlow Model And Running Inference (sampleUffMNIST) located in the GitHub repository.

  1. Create the builder and network:
    IBuilder* builder = createInferBuilder(gLogger);
    INetworkDefinition* network = builder->createNetwork();
  2. Create the UFF parser:
    IUFFParser* parser = createUffParser();
  3. Declare the network inputs and outputs to the UFF parser:
    parser->registerInput("Input_0", DimsCHW(1, 28, 28), UffInputOrder::kNCHW);
    parser->registerOutput("Binary_3");
    
  4. Parse the imported model to populate the network:
    parser->parse(uffFile, *network, nvinfer1::DataType::kFLOAT);

2.2.5. Importing An ONNX Model Using The C++ Parser API

Restriction: Since the ONNX format is quickly developing, you may encounter a version mismatch between the model version and the parser version. The ONNX Parser shipped with TensorRT 5.1.x supports ONNX IR (Intermediate Representation) version 0.0.3, opset version 9.

In general, the newer version of the ONNX Parser is designed to be backward compatible, therefore, encountering a model file produced by an earlier version of ONNX exporter should not cause a problem. There could be some exceptions when the changes were not backward compatible. In this case, convert the earlier ONNX model file into a later supported version. For more information on this subject, see ONNX Model Opset Version Converter.

It is also possible that the user model was generated by an exporting tool supporting later opsets than supported by the ONNX parser shipped with TensorRT. In this case, check whether the latest version of TensorRT released to GitHub, onnx-tensorrt, supports the required version. The supported version is defined by the BACKEND_OPSET_VERSION variable in onnx_trt_backend.cpp. Download and build the latest version of ONNX TensorRT Parser from the GitHub. The instructions for building can be found here: TensorRT backend for ONNX.

The following steps illustrate how to import an ONNX model using the C++ Parser API. For more information about the ONNX import, see "Hello World" For TensorRT From ONNX (sampleOnnxMNIST) located in the GitHub repository.
  1. Create the builder and network.
    IBuilder* builder = createInferBuilder(gLogger);
    INetworkDefinition* network = builder->createNetwork();
    
  2. Create the ONNX parser.
    nvonnxparser::IONNXParser* parser = 
    nvonnxparser::createONNXParser(*network, gLogger);
    
  3. Ingest the model:
    parser->parseFromFile(onnx_filename, 
    ILogger::Severity::kWARNING);

2.3. Building An Engine In C++

The next step is to invoke the TensorRT builder to create an optimized runtime. One of the functions of the builder is to search through its catalog of CUDA kernels for the fastest implementation available, and thus it is necessary to use the same GPU for building as that on which the optimized engine will run.

The builder has many properties that you can set in order to control such things as the precision at which the network should run, and autotuning parameters such as how many times TensorRT should time each kernel when ascertaining which is fastest (more iterations lead to longer runtimes, but less susceptibility to noise.) You can also query the builder to find out what reduced precision types are natively supported by the hardware.

Two particularly important properties are the maximum batch size and the maximum workspace size.
  • The maximum batch size specifies the batch size for which TensorRT will optimize. At runtime, a smaller batch size may be chosen.
  • Layer algorithms often require temporary workspace. This parameter limits the maximum size that any layer in the network can use. If an insufficient scratch is provided, it is possible that TensorRT may not be able to find an implementation for a given layer.
  1. Build the engine using the builder object:
    builder->setMaxBatchSize(maxBatchSize);
    IBuilderConfig * config = builder->createBuilderConfig();
    config->setMaxWorkspaceSize(1 << 20);
    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    

    When the engine is built, TensorRT makes copies of the weights.

  2. Dispense with the network, builder, and parser if using one.
    parser->destroy();
    network->destroy();
    config->destroy();
    builder->destroy();

2.4. Serializing A Model In C++

To serialize, you are transforming the engine into a format to store and use at a later time for inference. To use for inference, you would simply deserialize the engine. Serializing and deserializing are optional. Since creating an engine from the Network Definition can be time consuming, you could avoid rebuilding the engine every time the application reruns by serializing it once and deserializing it while inferencing. Therefore, after the engine is built, users typically want to serialize it for later use.
It is not absolutely necessary to serialize and deserialize a model before using it for inference – if desirable, the engine object can be used for inference directly.
Note: Serialized engines are not portable across platforms or TensorRT versions. Engines are specific to the exact GPU model they were built on (in addition to the platforms and the TensorRT version).
  1. Run the builder as a prior offline step and then serialize:
    IHostMemory *serializedModel = engine->serialize();
    // store model to disk
    // <…>
    serializedModel->destroy();
    
  2. Create a runtime object to deserialize:
    IRuntime* runtime = createInferRuntime(gLogger);
    ICudaEngine* engine = runtime->deserializeCudaEngine(modelData, modelSize, nullptr);
    

    The final argument is a plugin layer factory for applications using custom layers. For more information, see Extending TensorRT With Custom Layers.

2.5. Performing Inference In C++

The following steps illustrate how to perform inference in C++ now that you have an engine.
  1. Create some space to store intermediate activation values. Since the engine holds the network definition and trained parameters, additional space is necessary. These are held in an execution context:
    IExecutionContext *context = engine->createExecutionContext();

    An engine can have multiple execution contexts, allowing one set of weights to be used for multiple overlapping inference tasks. For example, you can process images in parallel CUDA streams using one engine and one context per stream. Each context will be created on the same GPU as the engine.

  2. Use the input and output blob names to get the corresponding input and output index:
    int inputIndex = engine->getBindingIndex(INPUT_BLOB_NAME);
    int outputIndex = engine->getBindingIndex(OUTPUT_BLOB_NAME);
    
  3. Using these indices, set up a buffer array pointing to the input and output buffers on the GPU:
    void* buffers[2];
    buffers[inputIndex] = inputbuffer;
    buffers[outputIndex] = outputBuffer;
    
  4. TensorRT execution is typically asynchronous, so enqueue the kernels on a CUDA stream:
    context->enqueue(batchSize, buffers, stream, nullptr);

    It is common to enqueue asynchronous memcpy() before and after the kernels to move data from the GPU if it is not already there. The final argument to enqueue() is an optional CUDA event which will be signaled when the input buffers have been consumed and their memory may be safely reused.

    To determine when the kernel (and possibly memcpy()) are complete, use standard CUDA synchronization mechanisms such as events, or waiting on the stream.

2.6. Memory Management In C++

TensorRT provides two mechanisms to allow the application more control over device memory.

By default, when creating an IExecutionContext, persistent device memory is allocated to hold activation data. To avoid this allocation, call createExecutionContextWithoutDeviceMemory. It is then the application’s responsibility to call IExecutionContext::setDeviceMemory() to provide the required memory to run the network. The size of the memory block is returned by ICudaEngine::getDeviceMemorySize().

In addition, the application can supply a custom allocator for use during build and runtime by implementing the IGpuAllocator interface. Once the interface is implemented, call
setGpuAllocator(&allocator);
on the IBuilder or IRuntime interfaces. All device memory will then allocated and freed through this interface.

2.7. Refitting An Engine

TensorRT can refit an engine with new weights, without having to rebuild it. The engine must be built as “refittable”. Because of the way the engine is optimized, if you change some weights, you may have to supply some other weights too. The interface can tell you what additional weights need to be supplied.

  1. Request a refittable engine before building it:
    ...
    builder->setRefittable(true); 
    builder->buildCudaEngine(network);
    
  2. Create a refitter object:
    ICudaEngine* engine = ...;
    IRefitter* refitter = createInferRefitter(*engine,gLogger)
    
  3. Update the weights that you want to update. For example, to update the kernel weights for a convolution layer named “MyLayer”:
    Weights newWeights = ...;
    refitter->setWeights("MyLayer",WeightsRole::kKERNEL,
                        newWeights);
    

    The new weights should have the same count as the original weights used to build the engine.

    setWeights returns false if something went wrong, such as a wrong layer name or role, or a change in the weights count.

  4. Find out what other weights must be supplied. This typically requires two calls to IRefitter::getMissing, first to get the number of Weights objects that must be supplied, and second to get their layers and roles.
    const int n = refitter->getMissing(0, nullptr, nullptr);
    std::vector<const char*> layerNames(n);
    std::vector<WeightsRole> weightsRoles(n);
    refitter->getMissing(n, layerNames.data(), 
                            weightsRoles.data());
    
  5. Supply the missing weights, in any order:
    for (int i = 0; i < n; ++i)
        refitter->setWeights(layerNames[i], weightsRoles[i],
                             Weights{...});
    

    Supplying only the missing weights will not generate a need for any more weights. Supplying any additional weights may trigger the need for yet more weights.

  6. Update the engine with all the weights that are provided:
    bool success = refitter->refitCudaEngine();
    assert(success);
    

    If success is false, check the log for a diagnostic, perhaps about weights that are still missing.

  7. Destroy the refitter:
    refitter->destroy();

The updated engine behaves is if it had been built from a network updated with the new weights.

To see all refittable weights in an engine, use refitter->getAll(...); similarly to how getMissing was used in step 3.

3. Using The Python API

The following sections highlight the TensorRT user goals and tasks that you can perform using the Python API. These sections focus on using the Python API without any frameworks. Further details are provided in the Samples Support Guide and are linked to below where appropriate.

The assumption is that you are starting with a trained model. This chapter will cover the following necessary steps in using TensorRT:
  • Creating a TensorRT network definition from your model
  • Invoking the TensorRT builder to create an optimized runtime engine from the network
  • Serializing and deserializing the engine so that it can be rapidly recreated at runtime
  • Feeding the engine with data to perform inference

Python API vs C++ API

In essence, the C++ API and the Python API should be close to identical in supporting your needs. The main benefit of the Python API is that data preprocessing and postprocessing is easy to use because you’re able to use a variety of libraries like NumPy and SciPy.

The C++ API should be used in situations where safety is important, for example, like in automotive. For more information about the C++ API, see Using The C++ API.

For more information about how to optimize performance using Python, see How Do I Optimize My Python Performance? from the TensorRT Best Practices guide.

3.1. Importing TensorRT Into Python

  1. Import TensorRT:
    import tensorrt as trt
  2. Implement a logging interface through which TensorRT reports errors, warnings, and informational messages. The following code shows how to implement the logging interface. In this case, we have suppressed informational messages, and report only warnings and errors. There is a simple logger included in the TensorRT Python bindings.
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

3.2. Creating A Network Definition In Python

The first step in performing inference with TensorRT is to create a TensorRT network from your model. The easiest way to achieve this is to import the model using the TensorRT parser library, (see Importing A Model Using A Parser In Python, Importing From Caffe Using Python, Importing From TensorFlow Using Python, and Importing From ONNX Using Python), which supports serialized models in the following formats:
  • Caffe (both BVLC and NVCaffe)
  • ONNX 1.0 and 1.1, and
  • UFF (used for TensorFlow)

An alternative is to define the model directly using the TensorRT Network API, (see Creating A Network Definition From Scratch Using The Python API). This requires you to make a small number of API calls to define each layer in the network graph, and to implement your own import mechanism for the model’s trained parameters.

Note: The TensorRT Python API is not available for all platforms. For more information, see TensorRT Support Matrix

3.2.1. Creating A Network Definition From Scratch Using The Python API

When creating a network, you must first define the engine and create a builder object for inference. The Python API is used to create a network and engine from the Network APIs. The network definition reference is used to add various layers to the network. For more information about using the Python API to create a network and engine, see the "Hello World" For TensorRT Using PyTorch And Python (network_api_pytorch_mnist) sample.

The following code illustrates how to create a simple network with Input, Convolution, Pooling, FullyConnected, Activation and SoftMax layers.

# Create the builder and network
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
	# Configure the network layers based on the weights provided. In this case, the weights are imported from a pytorch model. 
	# Add an input layer. The name is a string, dtype is a TensorRT dtype, and the shape can be provided as either a list or tuple.
	input_tensor = network.add_input(name=INPUT_NAME, dtype=trt.float32, shape=INPUT_SHAPE)

	# Add a convolution layer
	conv1_w = weights['conv1.weight'].numpy()
	conv1_b = weights['conv1.bias'].numpy()
	conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b)
	conv1.stride = (1, 1)

	pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
	pool1.stride = (2, 2)
	conv2_w = weights['conv2.weight'].numpy()
	conv2_b = weights['conv2.bias'].numpy()
	conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
	conv2.stride = (1, 1)

	pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
	pool2.stride = (2, 2)

	fc1_w = weights['fc1.weight'].numpy()
	fc1_b = weights['fc1.bias'].numpy()
	fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b)

	relu1 = network.add_activation(fc1.get_output(0), trt.ActivationType.RELU)

	fc2_w = weights['fc2.weight'].numpy()
	fc2_b = weights['fc2.bias'].numpy()
	fc2 = network.add_fully_connected(relu1.get_output(0), OUTPUT_SIZE, fc2_w, fc2_b)

	fc2.get_output(0).name =OUTPUT_NAME
	network.mark_output(fc2.get_output(0))

3.2.2. Importing A Model Using A Parser In Python

To import a model using a parser, you will need to perform the following high-level steps:
  1. Create the TensorRTbuilder and network.
  2. Create the TensorRT parser for the specific format.
  3. Use the parser to parse the imported model and populate the network.
For examples regarding each of these steps and sample code, see Importing From Caffe Using Python, Importing From TensorFlow Using Python, and Importing From ONNX Using Python.

The builder must be created before the network because it serves as a factory for the network. Different parsers have different mechanisms for marking network outputs. For more information, see the UFF Parser API, Caffe Parser API, and ONNX Parser API.

3.2.3. Importing From Caffe Using Python

The following steps illustrate how to import a Caffe model directly using the CaffeParser and the Python API. For more information, see the Introduction To Importing Caffe, TensorFlow And ONNX Models Into TensorRT Using Python (introductory_parser_samples) sample.
  1. Import TensorRT.
    import tensorrt as trt
  2. Define the data type. In this example, we will use float32.
    datatype = trt.float32
  3. Additionally, define some paths. Change the following paths to reflect where you placed the model included with the samples:
    deploy_file = 'data/mnist/mnist.prototxt'
    model_file = 'data/mnist/mnist.caffemodel'
    
  4. Create the builder, network, and parser:
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.CaffeParser() as parser:
    model_tensors = parser.parse(deploy=deploy_file, model=model_file, network=network, dtype=datatype)
    

    The parser returns the model_tensors, which is a table containing the mapping from tensor names to ITensor objects.

3.2.4. Importing From TensorFlow Using Python

The following steps illustrate how to import a TensorFlow model directly using the UffParser and the Python API. This sample can be found in the <site-packages>/tensorrt/samples/python/end_to_end_tensorflow_mnist directory. For more information, see the "Hello World" For TensorRT Using TensorFlow And Python (end_to_end_tensorflow_mnist) sample.
  1. Import TensorRT:
    import tensorrt as trt
  2. Create a frozen TensorFlow model for the tensorflow model. The instructions on freezing a TensorFlow model into a stream can be found in Freezing A TensorFlow Graph.
  3. Use the UFF converter to convert a frozen tensorflow model to a UFF file. Typically, this is as simple as:
    convert-to-uff frozen_inference_graph.pb
    

    Depending on how you installed TensorRT, the convert-to-uff utility might not be installed in your system path. In this case, invoke the underlying Python script directly. It should be located in the bin directory of the UFF module; for example, ~/.local/lib/python2.7/site-packages/uff/bin/convert_to_uff.py.

    To find the location of the UFF module, run the python -c “import uff; print(uff.__path__)” command.

    Alternatively, you can use the UFF Parser API and convert the TensorFlow GraphDef directly.

  4. Define some paths. Change the following paths to reflect where you placed the model that is included with the samples:
    model_file = '/data/mnist/mnist.uff'
  5. Create the builder, network, and parser:
    with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
        	parser.register_input("Placeholder", (1, 28, 28))
        	parser.register_output("fc2/Relu")
    parser.parse(model_file, network)
    

3.2.5. Importing From ONNX Using Python

The following steps illustrate how to import an ONNX model directly using the OnnxParser and the Python API. For more information, see the Introduction To Importing Caffe, TensorFlow And ONNX Models Into TensorRT Using Python (introductory_parser_samples) sample.
Restriction:

Since the ONNX format is quickly developing, you may encounter a version mismatch between the model version and the parser version. The OnnxParser shipped with TensorRT 5.1.x supports ONNX IR (Intermediate Representation) version 0.0.3, opset version 9.

In general, the newer version of the OnnxParser is designed to be backward compatible, therefore, encountering a model file produced by an earlier version of ONNX exporter should not cause a problem. There could be some exceptions when the changes were not backward compatible. In this case, convert the earlier ONNX model file into a later supported version. For more information on this subject, see ONNX Model Opset Version Converter.

It is also possible that the user model was generated by an exporting tool supporting later opsets than supported by the ONNX parser shipped with TensorRT. In this case, check whether the latest version of TensorRT released to GitHub, onnx-tensorrt, supports the required version. For more information, see the Object Detection With The ONNX TensorRT Backend In Python (yolov3_onnx) sample.

The supported version is defined by the BACKEND_OPSET_VERSION variable in onnx_trt_backend.cpp. Download and build the latest version of ONNXTensorRT Parser from GitHub. The instructions for building can be found here: TensorRT backend for ONNX.

  1. Import TensorRT:
    import tensorrt as trt
  2. Create the build, network, and parser:
    with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
    with open(model_path, 'rb') as model:
    parser.parse(model.read())
    

3.2.6. Importing From PyTorch And Other Frameworks

Using TensorRT with PyTorch (or any other framework with NumPy compatible weights) involves replicating the network architecture using the TensorRT API, (see Creating A Network Definition From Scratch Using The Python API), and then copying the weights from PyTorch. For more information, see Working With PyTorch And Other Frameworks.
Note: On Ubuntu 14.04 and CentOS, loading the torch module and TensorRT at the same time may cause segmentation faults.

To perform inference, follow the instructions outlined in Performing Inference In Python.

3.3. Building An Engine In Python

One of the functions of the builder is to search through its catalog of CUDA kernels for the fastest implementation available, and thus it is necessary to use the same GPU for building as that on which the optimized engine will run.

The builder has many properties that you can set in order to control such things as the precision at which the network should run, and autotuning parameters such as how many times TensorRT should time each kernel when ascertaining which is fastest (more iterations leads to longer runtimes, but less susceptibility to noise.) You can also query the builder to find out what mixed precision types are natively supported by the hardware.

Two particularly important properties are the maximum batch size and the maximum workspace size.
  • The maximum batch size specifies the batch size for which TensorRT will optimize. At runtime, a smaller batch size may be chosen.
  • Layer algorithms often require temporary workspace. This parameter limits the maximum size that any layer in the network can use. If insufficient scratch is provided, it is possible that TensorRT may not be able to find an implementation for a given layer.

For more information about building an engine in Python, see the Introduction To Importing Caffe, TensorFlow And ONNX Models Into TensorRT Using Python (introductory_parser_samples) sample.

  1. Build the engine using the builder object:
    builder.max_batch_size = max_batch_size
    builder.max_workspace_size = 1 << 20 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible.
    with trt.Builder(TRT_LOGGER) as builder, builder.create_builder_config() as config, builder.build_cuda_engine(network, config) as engine:
    # Do inference here.
    

    When the engine is built, TensorRT makes copies of the weights.

  2. Perform inference. To perform inference, follow the instructions outlined in Performing Inference In Python.

3.4. Serializing A Model In Python

When you serialize, you are transforming the engine into a format to store and use at a later time for inference. To use for inference, you would simply deserialize the engine. Serializing and deserializing are optional. Since creating an engine from the Network Definition can be time consuming, you could avoid rebuilding the engine every time the application reruns by serializing it once and deserializing it while inferencing. Therefore, after the engine is built, users typically want to serialize it for later use.
From here onwards, you can either serialize the engine or you can use the engine directly for inference. Serializing and deserializing a model is an optional step before using it for inference - if desirable, the engine object can be used for inference directly.
Note: Serialized engines are not portable across platforms or TensorRT versions. Engines are specific to the exact GPU model they were built on (in addition to the platforms and the TensorRT version).
  1. Serialize the model to a modelstream:
    serialized_engine = engine.serialize()
  2. Deserialize modelstream to perform inference. Deserializing requires creation of a runtime object:
    with trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(serialized_engine)
    

The final argument is a plugin layer factory for applications using custom layers, and is optional otherwise. More details can be found in Extending TensorRT With Custom Layers.

It is also possible to save a serialized engine to a file, and read it back from the file:
  1. Serialize the engine and write to a file:
    with open(“sample.engine”, “wb”) as f:
    		f.write(engine.serialize())
  2. Read the engine from the file and deserialize:
    with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime:
    		engine = runtime.deserialize_cuda_engine(f.read())
    

3.5. Performing Inference In Python

The following steps illustrate how to perform inference in Python, now that you have an engine.
  1. Allocate some host and device buffers for inputs and outputs:
    # Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
    		h_input = cuda.pagelocked_empty(engine.get_binding_shape(0).volume(), dtype=np.float32)
    		h_output = cuda.pagelocked_empty(engine.get_binding_shape(1).volume(), dtype=np.float32)
    		# Allocate device memory for inputs and outputs.
    		d_input = cuda.mem_alloc(h_input.nbytes)
    		d_output = cuda.mem_alloc(h_output.nbytes)
    		# Create a stream in which to copy inputs/outputs and run inference.
    		stream = cuda.Stream()
  2. Create some space to store intermediate activation values. Since the engine holds the network definition and trained parameters, additional space is necessary. These are held in an execution context:
    with engine.create_execution_context() as context:
    		# Transfer input data to the GPU.
    		cuda.memcpy_htod_async(d_input, h_input, stream)
    		# Run inference.
    		context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
    		# Transfer predictions back from the GPU.
    		cuda.memcpy_dtoh_async(h_output, d_output, stream)
    		# Synchronize the stream
    		stream.synchronize()
    		# Return the host output. 
    return h_output
    
    

    An engine can have multiple execution contexts, allowing one set of weights to be used for multiple overlapping inference tasks. For example, you can process images in parallel CUDA streams using one engine and one context per stream. Each context will be created on the same GPU as the engine.

4. Extending TensorRT With Custom Layers

TensorRT supports many types of layers and its functionality is continually extended; however, there may be cases in which the layers supported do not cater to the specific needs of a model. In this case, users can extend TensorRT functionalities by implementing custom layers using the IPluginV2Ext class for the C++ and Python API. Custom layers, often referred to as plugins, are implemented and instantiated by an application, and their lifetime must span their use within a TensorRT engine.

4.1. Adding Custom Layers Using The C++ API

A custom layer is implemented by extending the IPluginCreator class and one of TensorRT’s base classes for plugins.

IPluginCreator is a creator class for custom layers using which, users can get plugin name, version and plugin field parameters. It also provides methods to create the plugin object during network build phase and deserialize it during inference.

You must derive your plugin class from one of the base classes for plugins. They have varying expressive power with respect to supporting inputs/outputs with different types/formats or networks with dynamic shapes. The table below summarizes the base classes, ordered from least expressive to most expressive.
Table 1. Base classes, ordered from least expressive to most expressive
  Introduced in TensorRT version? Mixed input/output formats/types Dynamic shapes? Requires extended runtime?
IPluginV2Ext 5.1 Limited No No
IPluginV2IOExt 6.0.1 General No No
IPluginV2DynamicExt 6.0.1 General Yes Yes
All of these base classes include versioning support and helps enable custom layers that support other data formats besides NCHW and single precision. IPluginV2Ext exists in TensorRT 6.0.1 for backwards compatibility with 5.1.
Note: If using either IPluginV2Ext, IPluginV2IOExt, or IPluginV2DynamicExt, it is recommended to always provide an FP32 implementation of the plugin in order to allow the plugin to properly operate with any network.
If writing a new custom layer, we recommend using IPluginV2IOExt if your layer must be usable without the extended runtime or you do not need to support dynamic shapes, otherwise use IPluginV2DynamicExt.
Note: In versions of TensorRT prior to 6.0.1, you derived custom layers from IPluginV2 or IPluginV2Ext. While these APIs are still supported, we highly encourage you to move to IPluginV2IOExt or IPluginV2DynamicExtto be able to use all the new plugin functionalities.
TensorRT also provides the ability to register a plugin by calling REGISTER_TENSORRT_PLUGIN(pluginCreator) which statically registers the Plugin Creator to the Plugin Registry. During runtime, the Plugin Registry can be queried using the extern function getPluginRegistry(). The Plugin Registry stores a pointer to all the registered Plugin Creators and can be used to look up a specific Plugin Creator based on the plugin name and version. The TensorRT library contains plugins that can be loaded into your application. The version of all these plugins is set to 1. The names of these plugins are:
  • RPROI_TRT
  • Normalize_TRT
  • PriorBox_TRT
  • GridAnchor_TRT
  • NMS_TRT
  • LReLU_TRT
  • Reorg_TRT
  • Region_TRT
  • Clip_TRT
Note: To use TensorRT registered plugins in your application, the libnvinfer_plugin.so library must be loaded and all plugins must be registered. This can be done by calling initLibNvInferPlugins(void* logger, const char* libNamespace)() in your application code.
Note: If you have your own plugin library, you can include a similar entry point to register all plugins in the registry under a unique namespace. This ensures there are no plugin name collisions during build time across different plugin libraries.
For more information about these plugins, see the NvInferPlugin.h file for reference.

Using the Plugin Creator, the IPluginCreator::createPlugin() function can be called which returns a plugin object of type IPluginV2. This object can be added to the TensorRT network using addPluginV2() which creates and adds a layer to a network, and then binds the layer to the given plugin. The method also returns a pointer to the layer (of type IPluginV2Layer), which can be used to access the layer or the plugin itself (via getPlugin()).

For example, to add a plugin layer to your network with plugin name set to pluginName and version set to pluginVersion, you can issue the following:
//Use the extern function getPluginRegistry to access the global TensorRT Plugin Registry
auto creator = getPluginRegistry()->getPluginCreator(pluginName, pluginVersion);
const PluginFieldCollection* pluginFC = creator->getFieldNames();
//populate the field parameters (say layerFields) for the plugin layer 
PluginFieldCollection *pluginData = parseAndFillFields(pluginFC, layerFields); 
//create the plugin object using the layerName and the plugin meta data
IPluginV2 *pluginObj = creator->createPlugin(layerName, pluginData);
//add the plugin to the TensorRT network using the network API
auto layer = network.addPluginV2(&inputs[0], int(inputs.size()), pluginObj);
… (build rest of the network and serialize engine)
pluginObj->destroy() // Destroy the plugin object
… (destroy network, engine, builder)
… (free allocated pluginData)
Note:pluginData should allocate the PluginField entries on the heap before passing to createPlugin.
Note: The createPlugin method above will create a new plugin object on the heap and returns a pointer to it. Ensure you destroy the pluginObj, as shown above, to avoid a memory leak.

During serialization, the TensorRT engine will internally store the plugin type, plugin version and namespace (if it exists) for all IPluginV2 type plugins. During deserialization, this information is looked up by the TensorRT engine to find the Plugin Creator from the Plugin Registry. This enables the TensorRT engine to internally call the IPluginCreator::deserializePlugin() method. The plugin object created during deserialization will be destroyed internally by the TensorRT engine by calling IPluginV2::destroy() method.

In previous versions of TensorRT, you had to implement the nvinfer1::IPluginFactory class to call the createPlugin method during deserialization. This is no longer necessary for plugins registered with TensorRT and added using addPluginV2.

4.1.1. Example 1: Adding A Custom Layer Using C++ For Caffe

To add a custom layer in C++, derive it from one of the base classes described in Adding Custom Layers Using The C++ API. Since Caffe does not need dynamic shapes, this example uses IPluginV2IOExt. For Caffe based networks, if using the TensorRT Caffe Parser, you will also derive classes from nvcaffeparser1::IPluginFactoryExt (for plugins of type IPluginExt) and nvinfer1::IPluginFactory. For more information, see Using Custom Layers When Importing A Model From A Framework.

The following sample code adds a new plugin called FooPlugin:
class FooPlugin : public IPluginV2IOExt
{
	...override all pure virtual methods of IPluginV2IOExt with definitions for your plugin.  Do not override the TRT_DEPRECATED methods.
};

class MyPluginFactory : public nvinfer1::IPluginFactory, public nvcaffeparser1::IPluginFactoryExt
{
	...implement all factory methods for your plugin
};
If you are using plugins registered with the TensorRT plugin registry of type IPluginV2, then you do not need to implement the nvinfer1::IPluginFactory class. However, you do need to implement the nvcaffeparser1::IPluginFactoryV2 and IPluginCreator classes instead and register them.
class FooPlugin : public IPluginV2IOExt
{
	...implement all class methods for your plugin
};

class FooPluginFactory : public nvcaffeparser1::IPluginFactoryV2
{
	virtual nvinfer1::IPluginV2* createPlugin(...)
	{
		...create and return plugin object of type FooPlugin
	}
	bool isPlugin(const char* name) 
	{
		...check if layer name corresponds to plugin
	}
}

class FooPluginCreator : public IPluginCreator
{
	...implement all creator methods here
};
REGISTER_TENSORRT_PLUGIN(FooPluginCreator);

The following samples illustrate how to add a custom plugin layer using C++ for Caffe networks:

4.1.2. Example 2: Adding A Custom Layer That Is Not Supported In UFF Using C++

In order to run TensorFlow networks with TensorRT, you must first convert it to the UFF format.

The following steps add a custom plugin layer in C++ for TensorFlow networks:

  1. Implement the IPluginV2 and IPluginCreator classes as shown in Example 1: Adding A Custom Layer Using C++ For Caffe.
  2. Map the TensorFlow operation to the plugin operation. You can use GraphSurgeon for this. For example, refer to the following code snippet to map the TensorFlowRelu6 operation to a plugin:
    import graphsurgeon as gs
    my_relu6 = gs.create_plugin_node(name=”MyRelu6”, op=”Clip_TRT”, clipMin=0.0, clipMax=6.0)
    Namespace_plugin_map = { “tf_relu6” : my_relu6 }
    def preprocess(dynamic_graph):
      dynamic_graph.collapse_namespaces(namespace_plugin_map)
    

    In the above code, tf_relu6 is the name of the Relu6 node in the TensorFlow graph. It maps the tf_relu6 node to a custom plugin node with operation “Clip_TRT” which is the name of the plugin to be used. Save the code above to a file called config.py. If the plugin layer expects parameters, they should be passed in as arguments to gs.create_plugin_node. In this case, clipMin and clipMax are the parameters expected by the clip plugin.

  3. Call the UFF converter with the preprocess -p flag set:
    convert-to-uff frozen_inference_graph.pb -p config.py -t

    This will generate a UFF file with the TensorFlow operations replaced by TensorRT plugin nodes.

  4. Run the pre-processed and converted UFF file with TensorRT using the UFF parser. For details, see Using Custom Layers When Importing A Model From A Framework.

    Object Detection With A TensorFlow SSD Network (sampleUffSSD) located in the GitHub repository, illustrates how to add a custom layer that is not supported in UFF using C++. See config.py in the sample folder for a demonstration of how to pre-process the graph.

4.1.3. Example 3: Adding A Custom Layer With Dynamic Shape Support Using C++

To support dynamic shapes, your plugin must derived from IPluginV2DynamicExt. The factory/creator and registry parts are similar to Example1, therefore the steps will not be repeated here.

BarPlugin is a plugin with two inputs and two outputs where:
  • The first output is a copy of the second input
  • The second output is the concatenation of both inputs, along the first dimension and all types/formats must be the same and be linear formats
BarPlugin needs to be derived as follows:
class BarPlugin : public IPluginV2DynamicExt
{
	...override virtual methods inherited from IPluginV2DynamicExt.
};

The inherited methods are all pure virtual methods, so the compiler will remind you if you forget one.

The four methods that are affected by dynamic shapes are:
  • getOutputDimensions
  • supportsFormatCombination
  • configurePlugin
  • enqueue
The override for getOutputDimensions returns symbolic expressions for the output dimensions in terms of the input dimensions. Build the expressions from the expressions for the inputs, using the IExprBuilder passed in to getOutputDimensions. In the example, the dimensions of the second output are the same as the dimensions of the first input, so no new expression has to be built for case 1.
DimsExprs BarPlugin::getOutputDimensions(int outputIndex, 
    const DimsExprs* inputs, int nbInputs, 
    IExprBuilder& exprBuilder)
{
    switch (outputIndex)
    {
    case 0: 
    {
        // First dimension of output is sum of input 
        // first dimensions.
        DimsExprs output(inputs[0]);
        output.d[0] = 
            exprBuilder.operation(DimensionOperation::kSUM, 
                inputs[0].d[0], inputs[1].d[0]);
	   return output;
    }
    case 1:
        return inputs[0];
    default:
         throw std::invalid_argument(“invalid output”);
}

The override for supportsFormatCombination must indicate whether a format combination is allowed. The interface indexes the inputs/outputs uniformly as “connections”, starting at 0 for the first input, then the rest of the inputs in order, followed by numbering the outputs. In the example, the inputs are connections 0 and 1, and the outputs are connections 2 and 3.

TensorRT uses supportsFormatCombination to ask whether a given combination of formats/types are okay for a connection, given formats/types for lesser indexed connections. So the override can assume that lesser indexed connections have already been vetted and focus on the connection with index pos.
bool BarPlugin::supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) override
{
    assert(0 <= pos && pos < 4);
    const auto* in = inOut;
    const auto* out = inOut + nbInputs;
    switch (pos)
    {
    case 0: in[0].format == TensorFormat::kLINEAR;
    case 1: return in[1].type == in[0].type &&
                   in[0].format == TensorFormat::kLINEAR;
    case 2: return out[0].type == in[0].type &&
                   out[0].format == TensorFormat::kLINEAR;
    case 3: return out[1].type == in[0].type &&
                   out[1].format == TensorFormat::kLINEAR;
    }
    throw std::invalid_argument(“invalid connection number”);
}
The local variables in and out here allow inspecting inOut by input or output number instead of connection number.
Important: The override may inspect the format/type for a connection with an index less than pos, but must never inspect the format/type for a connection with an index greater than pos. The example uses case 3 to check connection 3 against connection 0, and not use case 0 to check connection 0 against connection 3.
TensorRT uses configurePlugin to setup a plugin at runtime. Our plugin doesn’t need configurePlugin to do anything , so it’s a no-op:
void BarPlugin::configurePlugin(
    const DynamicPluginTensorDesc* in, int nbInputs, 
    const DynamicPluginTensorDesc* out, int nbOutputs) override
{
}

If the plugin needed to know the minimum or maximum dimensions it might encounter, it can inspect field DynamicPluginTensorDesc::min or DynamicPluginTensorDesc::max for any input or output. Format and build-time dimension information can be found in DynamicPluginTensorDesc::desc. Any runtime dimensions will appear as -1. The actual dimension is supplied to BarPlugin::enqueue.

Finally, the override BarPlugin::enqueue has to do the work. Since shapes are dynamic, enqueue is handed a PluginTensorDesc that describes the actual dimensions, type, and format of each input and output.

4.1.4. Example 4: Add A Custom Layer With INT8 I/O Support Using C++

To support INT8 I/O, your plugin can be derived from either IPluginV2IOExt or IPluginV2DynamicExt. However, if your application disallows extended runtime, IPluginV2IOExt must be derived. The general steps are similar to Example 1: Adding A Custom Layer Using C++ For Caffe and Example 3: Adding A Custom Layer With Dynamic Shape Support Using C++, therefore the repeated parts (factory/creator and registry) will not be presented here.

UffPoolPluginV2 is a plugin to demonstrate how to extend INT8 I/O for the custom pooling layer. The derivation is as follows:
class UffPoolPluginV2 : public IPluginV2IOExt
{
    ...override virtual methods inherited from IPluginV2IOExt.
};
Most of the pure virtual methods are common to plugins. The main methods that affect INT8 I/O are:
  • supportsFormatCombination
  • configurePlugin
  • enqueue
The override for supportsFormatCombination must indicate which INT8 I/O combination is allowed. The usage of this interface is similar to Example 3: Adding A Custom Layer With Dynamic Shape Support Using C++. In this example, the supported I/O tensor format is linear CHW while INT32 is excluded, but the I/O tensor must have the same data type.
bool UffPoolPluginV2::supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) const override
{
    assert(nbInputs == 1 && nbOutputs == 1 && pos < nbInputs + nbOutputs);
    bool condition = inOut[pos].format == TensorFormat::kLINEAR;
    condition &= inOut[pos].type != DataType::kINT32;
    condition &= inOut[pos].type == inOut[0].type;
    return condition;
}
Important:
  • If INT8 auto-calibration must be used with a network with INT8 I/O plugins, FP32 I/O variant should be supported by the plugin as it is used by FP32 calibration graph.
  • If FP32 I/O variant is not supported, or INT8 auto-calibration is not used, all required INT8 I/O tensors scales should be set explicitly.
  • Auto-calibration won't generate dynamic range for plugin internal tensors. INT8 I/O plugins should calculate their own per-tensor dynamic range for internal tensors for purpose of quantization or dequantization.
TensorRT invokes configurePlugin method to pass the information to the plugin through PluginTensorDesc, which are stored as member variables, serialized and deserialized.
void UffPoolPluginV2::configurePlugin(const PluginTensorDesc* in, int nbInput, const PluginTensorDesc* out, int nbOutput)
{
    ...
    mPoolingParams.mC = mInputDims.d[0];
    mPoolingParams.mH = mInputDims.d[1];
    mPoolingParams.mW = mInputDims.d[2];
    mPoolingParams.mP = mOutputDims.d[1];
    mPoolingParams.mQ = mOutputDims.d[2];
    mInHostScale = in[0].scale >= 0.0f ? in[0].scale : -1.0f;
    mOutHostScale = out[0].scale >= 0.0f ? out[0].scale : -1.0f;
}

Where INT8 I/O scales per tensor can be obtained from PluginTensorDesc::scale.

Finally, the override UffPoolPluginV2::enqueue has to do the work. It includes a collection of core algorithms to execute the custom layer at runtime by using the actual batch size, inputs, outputs, cuDNN stream and the information configured.
int UffPoolPluginV2::enqueue(int batchSize, const void* const* inputs, void** outputs, void* workspace, cudaStream_t stream)
{
    ...
    CHECK(cudnnPoolingForward(mCudnn, mPoolingDesc, &kONE, mSrcDescriptor, input, &kZERO, mDstDescriptor, output));
    ...
    return 0;
}

4.2. Adding Custom Layers Using The Python API

Although the C++ API is the preferred language to implement custom layers; due to easily accessing libraries like CUDA and cuDNN, you can also work with custom layers in Python applications.

You can use the C++ API to create a custom layer, package the layer using pybind11 in Python, then load the plugin into a Python application. For more information, see Creating A Network Definition In Python.

The same custom layer implementation can be used for both C++ and Python. For more information, see the Adding A Custom Layer To Your Caffe Network In TensorRT In Python (fc_plugin_caffe_mnist) sample.

4.2.1. Example 1: Adding A Custom Layer to a TensorRT Network Using Python

Custom layers can be added to any TensorRT network in Python using plugin nodes. The Python API has a function called add_plugin_v2 which enables you to add a plugin node to a network. The following example illustrates this. It creates a simple TensorRT network and adds a Leaky ReLU plugin node by looking up TensorRT Plugin Registry.
import tensorrt as trt
import numpy as np

TRT_LOGGER = trt.Logger()

trt.init_libnvinfer_plugins(TRT_LOGGER, '')
PLUGIN_CREATORS = trt.get_plugin_registry().plugin_creator_list

def get_trt_plugin(plugin_name):
        plugin = None
        for plugin_creator in PLUGIN_CREATORS:
            if plugin_creator.name == plugin_name:
                lrelu_slope_field = trt.PluginField("neg_slope", np.array([0.1], dtype=np.float32), trt.PluginFieldType.FLOAT32)
                field_collection = trt.PluginFieldCollection([lrelu_slope_field])
                plugin = plugin_creator.create_plugin(name=plugin_name, field_collection=field_collection)
        return plugin

def main():
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
        builder.max_workspace_size = 2**20
        input_layer = network.add_input(name="input_layer", dtype=trt.float32, shape=(1, 1))
        lrelu = network.add_plugin_v2(inputs=[input_layer], plugin=get_trt_plugin("LReLU_TRT"))
        lrelu.get_output(0).name = "outputs"
        network.mark_output(lrelu.get_output(0))

4.2.2. Example 2: Adding A Custom Layer That Is Not Supported In UFF Using Python

TensorFlow networks can be converted to UFF format and run with TensorRT using the Python interface. In order to do this, we make use of the GraphSurgeon API. If you are writing your own plugin, you need to implement it in C++ by implementing the IPluginExt and IPluginCreator classes as shown in Example 1: Adding A Custom Layer Using C++ For Caffe.

The following steps illustrate how you can use the UFF Parser to run custom layers using plugin nodes registered with the TensorRT Plugin Registry.

  1. Register the TensorRT plugins by calling trt.init_libnvinfer_plugins(TRT_LOGGER, '') (or load the .so file where you have registered your own plugin).
  2. Prepare the network and check the TensorFlow output:
    tf_sess = tf.InteractiveSession()
    tf_input = tf.placeholder(tf.float32, name="placeholder")
    tf_lrelu = tf.nn.leaky_relu(tf_input, alpha=lrelu_alpha, name="tf_lrelu")
    tf_result = tf_sess.run(tf_lrelu, feed_dict={tf_input: lrelu_args})
    tf_sess.close()
  3. Prepare the namespace mappings. The op name LReLU_TRT corresponds to the Leaky ReLU plugin shipped with TensorRT.
    trt_lrelu = gs.create_plugin_node(name="trt_lrelu", op="LReLU_TRT", negSlope=lrelu_alpha)
    namespace_plugin_map = {
                "tf_lrelu": trt_lrelu
     }
    
  4. Transform the TensorFlow graph using GraphSurgeon and save to UFF:
    dynamic_graph = gs.DynamicGraph(tf_lrelu.graph)
    dynamic_graph.collapse_namespaces(namespace_plugin_map)
  5. Run the UFF parser and compare results with TensorFlow:
    uff_model = uff.from_tensorflow(dynamic_graph.as_graph_def(), ["trt_lrelu"], output_filename=model_path, text=True)
    parser = trt.UffParser()
    parser.register_input("placeholder", [lrelu_args.size])
    parser.register_output("trt_lrelu")
    parser.parse(model_path, trt_network)
    

    For more information, see the Adding A Custom Layer To Your TensorFlow Network In TensorRT In Python (uff_custom_plugin) sample.

4.3. Using Custom Layers When Importing A Model From A Framework

TensorRT parsers use the layer operation field to identify if a particular layer in the network is a TensorFlow supported operation.

TensorFlow

Compared to previous releases of TensorRT, there are several changes with how custom layers in TensorFlow can be run with the TensorRT UFF parser. For TensorFlow models, use the UFF converter to convert your graph to a UFF file. In this process, if the network contains plugin layers it is also necessary to map the operation field of those layers to the corresponding registered plugin names in TensorRT. These plugins can either be plugins shipped with TensorRT or custom plugins that you have written. The plugin field names in the network should also match the fields expected by the plugin. This can be done using GraphSurgeon, as explained in Preprocessing A TensorFlow Graph Using the Graph Surgeon API, and as demonstrated in Object Detection With A TensorFlow SSD Network (sampleUffSSD), located in the GitHub repository, by using a config file with the UFF converter.

The UFF Parser will look up the Plugin Registry for every unsupported operation. If it finds a match with any of the registered plugin names, the parser will parse the plugin field parameters from the input network and create a plugin object using them. This object is then added to the network. In previous versions of TensorRT, you had to implement the nvuffparser::IPluginFactoryExt and manually pass the plugin parameters to the createPlugin(...) function. Although this flow can still be exercised, it is no longer necessary with the new additions to the Plugin API. For more information, see:

Caffe

For Caffe models, use the nvcaffeparser1::IPluginFactoryV2 class. The setPluginFactoryV2 method of the parser sets the factory in the parser to enable custom layers. While parsing a model description, for each layer, the parser invokes isPluginV2 to check with the factory if the layer name corresponds to a custom layer; if it does, the parser instantiates the plugin invoking createPlugin with the name of the layer (so that the factory can instantiate the corresponding plugin), a Weights array, and the number of weights as arguments. There is no restriction on the number of plugins that a single factory can support if they are associated with different layer names.

Note: For the Caffe parser, if setPluginFactoryV2 and IPluginFactoryV2 are used, the plugin object created during deserialization will be internally destroyed by the engine by calling IPluginExt::destroy(). You are only responsible for destroying the plugin object created during network creation step as shown in Adding Custom Layers Using The C++ API.

The Adding A Custom Layer To Your Network In TensorRT (samplePlugin) sample, located in the GitHub repository, illustrates how to extend nvcaffeparser1::IPluginFactoryExt to use custom layers, while Object Detection With A TensorFlow SSD Network (sampleUffSSD) uses the UFF Parser to use custom layers.

For the Python usage of custom layers with TensorRT, refer to the Adding A Custom Layer To Your Caffe Network In TensorRT In Python (fc_plugin_caffe_mnist) sample for Caffe networks, and Adding A Custom Layer To Your TensorFlow Network In TensorRT In Python (uff_custom_plugin) and Object Detection With SSD In Python (uff_ssd) samples for UFF networks.

4.3.1. Example 1: Adding A Custom Layer To A TensorFlow Model

In order to run a TensorFlow network with TensorRT, you must first convert it to the UFF format. During the conversion process, custom layers can be marked as plugin nodes using the graphsurgeon utility.

The UFF converter then converts the processed graph to the UFF format which is then run by the UFF Parser. The plugin nodes are then added to the TensorRT network by the UFF Parser.

For details using the C++ API, see Example 2: Adding A Custom Layer That Is Not Supported In UFF Using C++.

For details using the Python API, see Example 2: Adding A Custom Layer That Is Not Supported In UFF Using Python. Additionally, the Object Detection With SSD In Python (uff_ssd) sample demonstrates an end-to-end workflow in Python for running TensorFlow object detection networks using TensorRT.

4.4. Plugin API Description

All new plugins should derive classes from both IPluginCreator and one of the plugin base classes described in Adding Custom Layers Using The C++ API. In addition, new plugins should also call the REGISTER_TENSORRT_PLUGIN(...) macro to register the plugin with the TensorRT Plugin Registry or create an init function equivalent to initLibNvInferPlugins().

4.4.1. Migrating Plugins From TensorRT 5.x.x To TensorRT 6.x.x

The IPluginV2 interface is still supported, however, we recommend that you write new plugins with the IPluginV2Ext interface and migrate any existing plugin implementations to the IPluginV2Ext interface.

In order to use the most recent Plugin layer features, your custom plugin should implement the IPluginV2Ext interface. The new features are as follows:
virtual nvinfer1::DataType getOutputDataType(int index, const nvinfer1::DataType* inputTypes, int nbInputs) const = 0;


virtual bool isOutputBroadcastAcrossBatch(int outputIndex, const bool* inputIsBroadcasted, int nbInputs) const = 0;

virtual bool canBroadcastInputAcrossBatch(int inputIndex) const = 0;


virtual void configurePlugin(const Dims* inputDims, int nbInputs, const Dims* outputDims,
            int nbOutputs, const DataType* inputTypes, const DataType* outputTypes, const bool* inputIsBroadcast, const bool* outputIsBroadcast, PluginFormat floatFormat, int maxBatchSize) = 0;
For the simplest migration, follow these guidelines:
  • getOutputDataType can return the type of the input (from inputTypes) or DataType::kFLOAT if the layer has no inputs.
  • isOutputBroadcastAcrossBatch can return false if the plugin does not support output broadcast.
  • canBroadcastInputAcrossBatch can return false if the plugin cannot handle broadcasted inputs.
  • configurePlugin can mimic the behavior of configureWithFormat.

See the API description in IPluginV2 API Description for details about the API.

IPluginV2 API Description

The following section describes the functions of the IPluginV2 class.

To connect a plugin layer to neighboring layers and setup input and output data structures, the builder checks for the number of outputs and their dimensions by calling the following plugins methods:
getNbOutputs
Used to specify the number of output tensors.
getOutputDimensions
Used to specify the dimensions of an output as a function of the input dimensions.
supportsFormat
Used to check if a plugin supports a given data format.
getOutputDataType
Used to get the data type of the output at a given index. The returned data type must have a format that is supported by the plugin.
Plugin layers can support four data formats and layouts, for example:
  • NCHW single (FP32), half precision (FP16) and integer (INT32) tensors
  • NC/2HW2 and NHWC8 half precision (FP16) tensors
The formats are enumerated by PluginFormatType.

Plugins that do not compute all data in place and need memory space in addition to input and output tensors can specify the additional memory requirements with the getWorkspaceSize method, which is called by the builder to determine and pre-allocate scratch space.

During both build and inference time, the plugin layer is configured and executed, possibly multiple times. At build time, to discover optimal configurations, the layer is configured, initialized, executed, and terminated. Once the optimal format is selected for a plugin, the plugin is once again configured, and then it will be initialized once and executed as many times as needed for the lifetime of the inference application, and finally terminated when the engine is destroyed. These steps are controlled by the builder and the engine using the following plugin methods:
configurePlugin
Communicates the number of inputs and outputs, dimensions and datatypes of all inputs and outputs, broadcast information for all inputs and outputs, the chosen plugin format, and maximum batch size. At this point, the plugin sets up its internal state, and select the most appropriate algorithm and data structures for the given configuration.
initialize
The configuration is known at this time and the inference engine is being created, so the plugin can set up its internal data structures and prepare for execution.
enqueue
Encapsulates the actual algorithm and kernel calls of the plugin, and provides the runtime batch size, pointers to input, output, and scratch space, and the CUDA stream to be used for kernel execution.
terminate
The engine context is destroyed and all the resources held by the plugin should be released.
clone
This is called every time a new builder, network or engine is created which includes this plugin layer. It should return a new plugin object with the correct parameters.
destroy
Used to destroy the plugin object and/or other memory allocated each time a new plugin object is created. It is called whenever the builder or network or engine is destroyed.
set/getPluginNamespace
This method is used to set the library namespace that this plugin object belongs to (default can be ""). All plugin objects from the same plugin library should have the same namespace.
IPluginV2Ext supports plugins that can handle broadcast inputs and outputs. The following methods need to be implemented for this feature:
canBroadcastInputAcrossBatch
This method is called for each input whose tensor is semantically broadcast across a batch. If canBroadcastInputAcrossBatch returns true (meaning the plugin can support broadcast), TensorRT will not replicate the input tensor. There will be a single copy that the plugin should share across the batch. If it returns false, TensorRT will replicate the input tensor so that it appears like a non-broadcasted tensor.
isOutputBroadcastAcrossBatch
This is called for each output index. The plugin should return true the output at the given index is broadcast across batch.

4.4.3. IPluginCreator API Description

The following methods in the IPluginCreator class are used to find and create the appropriate plugin from the Plugin Registry:
getPluginName
This returns the plugin name and should match the return value of IPluginExt::getPluginType.
getPluginVersion
Returns the plugin version. For all internal TensorRT plugins, this defaults to 1.
getFieldNames
In order to successfully create a plugin, it is necessary to know all the field parameters of the plugin. This method returns the PluginFieldCollection struct with the PluginField entries populated to reflect the field name and PluginFieldType (the data should point to nullptr).
createPlugin
This method is used to create the plugin using the PluginFieldCollection argument. The data field of the PluginField entries should be populated to point to the actual data for each plugin field entry.
deserializePlugin
This method is called internally by the TensorRT engine based on the plugin name and version. It should return the plugin object to be used for inference.
set/getPluginNamespace
This method is used to set the namespace that this creator instance belongs to (default can be "").

4.4.4. Persistent LSTM Plugin

The following section describes the new Persistent LSTM plugin. The Persistent LSTM plugin supports half-precision persistent LSTM. To create a Persistent LSTM plugin in the network, you need to call:
auto creator = getPluginRegistry()->getPluginCreator("CgPersistentLSTMPlugin_TRT", "1") 

IPluginV2* cgPersistentLSTMPlugin = creator->createPlugin("CgPersistentLSTMPlugin_TRT", &fc);
fc is a PluginField array that consists of 4 parameters:
  • hiddenSize: This is a INT32 parameter that specifies the hidden size of LSTM.
  • numLayers: This is a INT32 parameter that specifies the number of layers in LSTM.
  • bidirectionFactor: This is a INT32 parameter that indicates whether LSTM is bidiretional. If LSTM is bidiretional, the value should be set to 2, otherwise, the value is set to 1.
  • setInitialStates: This is a INT32 parameter that indicates whether LSTM has initial state and cell values as inputs. If it is set to 0, the initial state and cell values will be zero. It is recommended to use this flag instead of providing zero state and cell values as inputs for better performance.
The plugin can be added to the network by calling:
auto lstmLayer = network->addPluginV2(&inputs[0], 6, *cgPersistentLSTMPlugin);
inputs is a vector of ITensor pointers with 6 elements in the following order:
  1. input: These are the input sequences to the LSTM.
  2. seqLenTensor: This is the sequence length vector that stores the effective length of each sequence.
  3. weight:This tensor consists of all weights needed for LSTM. Even though the this tensor is 1D, it can be viewed with the following 3D indexing [isW, layerNb, gateType]. isW starts from false to true suggesting that the first half of weight is recurrent weight and the second half is input weight. layerNb starts from 0 to numLayers*bidirectionFactor such that the first layer is the forward direction of the actual layer and second layer is the backward direction. The gateType follows this order: input, cell, forget and output.
  4. bias:Similar to weight, this tensor consists of all biases needed for LSTM. Even though the this tensor is 1D, it can be viewed with the following 3D indexing [layerNb, isW, gateType]. Notice the slight difference between bias and weight.
  5. initial hidden state: The pointer should be set to null if setInitialStates is 0. Otherwise, the tensor should consist of the initial hidden state values with the following coordinates [batch index, layerNb, hidden index]. batch index indicates the index within a batch and hidden index is the index to vectors of hiddenSize length.
  6. initial cell state: The pointer should be set to null if setInitialStates is 0. Otherwise, the tensor should consist of the initial hidden state values with the following coordinates [batch index, layerNb, hidden index].

4.5. Best Practices For Custom Layers Plugin

Converting User-Defined Layers

To create a custom layer implementation as a TensorRT plugin, you need to implement the IPluginV2Ext class and the IPluginCreator class for your plugin.

For more information about both API classes, see Plugin API Description.

For Caffe networks, see Example 1: Adding A Custom Layer Using C++ For Caffe. For TensorFlow (UFF) networks, see Example 2: Adding A Custom Layer That Is Not Supported In UFF Using C++.

Using The UFF Plugin API

For an example of how to use plugins with UFF in both C++ and Python, see Example 1: Adding A Custom Layer Using C++ For Caffe and Example 2: Adding A Custom Layer That Is Not Supported In UFF Using Python.

Debugging Custom Layer Issues

Memory allocated in the plugin must be freed to ensure no memory leak. If resources are acquired in the initialize() function, they need to be released in the terminate() function. All other memory allocations should be freed preferably in the plugin class destructor or in the destroy() method. Adding Custom Layers Using The C++ API outlines this in detail and also provides some notes for best practices when using plugins.

5. Working With Mixed Precision

Mixed precision is the combined use of different numerical precisions in a computational method. TensorRT can store weights and activations, and execute layers, in 32-bit floating point, 16-bit floating point, or quantized 8-bit integer.

Using precision lower than FP32 reduces memory usage, allowing deployment of larger networks. Data transfers take less time, and compute performance increases, especially on GPUs with Tensor Core support for that precision.

By default, TensorRT uses FP32 inference, but it also supports FP16 and INT8. While running FP16 inference, it automatically converts FP32 weights to FP16 weights.

You can check the supported precision on a platform using the following APIs:
if (builder->platformHasFastFp16()) { … }; 
if (builder->platformHasFastInt8()) { … };

Specifying the precision for a network defines the minimum acceptable precision for the application. Higher precision kernels may be chosen if they are faster for some particular set of kernel parameters, or if no lower-precision kernel exists. You can set the builder config flag BuilderFlag::kSTRICT_TYPES to force the network or layer precision, which may not have optimal performance. Usage of this flag is only recommended for debugging purposes.

You can also choose to set both INT8 and FP16 mode if the platform supports it. TensorRT will choose the most performance optimal kernel to perform inference.

5.1. Mixed Precision Using The C++ API

5.1.1. Setting The Layer Precision Using C++

If you want to run certain layers a specific precision, you can set the precision per layer using the following API:
layer->setPrecision(nvinfer1::DataType::kINT8)
This gives the layer’s inputs and outputs a preferred type (for example, DataType::kINT8). You can choose a different preferred type for an output of a layer using:
layer->setOutputType(out_tensor_index, nvinfer1::DataType::kFLOAT)
Note: This method cannot be used to set the data type of the second output tensor of the TopK layer. The data type of the second output tensor of the TopK layer is always INT32. For more information, see TopK Layer.

TensorRT has very few implementations that run in heterogeneous precision: in TensorRT 5.x.x the only ones are INT8 implementations for Convolution, Deconvolution, and FullyConnected layers that produce FP32 output.

Setting the precision, requests TensorRT to use a layer implementation whose inputs and outputs matches the preferred types, inserting reformat operations if necessary. By default, TensorRT will choose such an implementation only if it results in a higher-performance network. If an implementation at a higher precision is faster, TensorRT will use it, and issue a warning. Thus, you can detect whether using lower precision would result in unexpected performance loss.

You can override this behavior by making the type constraints strict.
IBuilderConfig * config = builder->createBuilderConfig();config->setFlag(BuilderFlag::kSTRICT_TYPES)

If the constraints are strict, TensorRT will obey them unless there is no implementation with the preferred precision constraints, in which case it will issue a warning and use the fastest available implementation.

If the precision is not explicitly set, TensorRT will select the computational precision based on performance considerations and the flags specified to the builder.
Note: When running INT8 precision on a GPU, the dimension of the layer tensor should be greater than or equal to 3. A layer tensor dimension less than 3 is not supported in TensorRT 6.0.1.

See Performing Inference In INT8 Precision (sampleINT8API) located in the GitHub repository for an example of running mixed precision inference with these APIs.

5.1.2. Enabling FP16 Inference Using C++

Setting the builder’s Fp16Mode flag indicates that 16-bit precision is acceptable.
config->setFlag(BuilderFlag::kFP16);
This flag allows, but does not guarantee, that 16-bit kernels will be used when building the engine. You can choose to force 16-bit precision by setting the following builder flag:
config->setFlag(BuilderFlag::kSTRICT_TYPES)

Weights can be specified in FP16 or FP32, and they will be converted automatically to the appropriate precision for the computation.

See Building And Running GoogleNet In TensorRT (sampleGoogleNet) and "Hello World" For TensorRT (sampleMNIST) located in the GitHub repository for running FP16 inference.

5.1.3. Enabling INT8 Inference Using C++

Setting the builder flag enables INT8 precision inference.
config->setFlag(BuilderFlag::kINT8);

In order to perform INT8 inference, FP32 activation tensors and weights need to be quantized. In order to represent 32-bit floating point values and INT 8-bit quantized values, TensorRT needs to understand the dynamic range of each activation tensor. The dynamic range is used to determine the appropriate quantization scale.

TensorRT supports symmetric quantization with quantization scale calculated using absolute maximum dynamic range values.

TensorRT needs the dynamic range for each tensor in the network. There are two ways in which the dynamic range can be provided to the network:
  • manually set the dynamic range for each network tensor using setDynamicRange API
Or
  • use INT8 calibration to generate per tensor dynamic range using the calibration dataset.

The dynamic range API can also be used along with INT8 calibration, such that manually setting the range will take precedence over the calibration generated dynamic range. Such a scenario is possible if INT8 calibration does not generate a satisfactory dynamic range for certain tensors.

For more information, see Performing Inference In INT8 Precision (sampleINT8API) located in the GitHub repository.

5.1.3.1. Setting Per-Tensor Dynamic Range Using C++

You can generate per tensor the dynamic range using various techniques. The basic technique includes recording per tensor the min and max values during the last epoch of training, or using quantization aware training. TensorRT expects you to set the dynamic range for each network tensor to perform INT8 inference. After you have the dynamic range information, you can set the dynamic range as follows:
ITensor* tensor = network->getLayer(layer_index)->getOutput(output_index);
tensor->setDynamicRange(min_float, max_float);
You also need to set the dynamic range for the network input:
ITensor* input_tensor = network->getInput(input_index);
input_tensor->setDynamicRange(min_float, max_float);

One way to achieve this is to iterate through the network layers and tensors and set per tensor the dynamic range. TensorRT only supports symmetric range currently, therefore, only abs(min_float) and abs(max_float) is used for quantization. For more information, see Performing Inference In INT8 Precision (sampleINT8API) located in the GitHub repository.

5.1.3.2. INT8 Calibration Using C++

INT8 calibration provides an alternative to generate per activation tensor the dynamic range. This methods can be categorized as post training technique to generate the appropriate quantization scale. The process of determining these scale factors is called calibration, and requires the application to pass batches of representative input for the network (typically batches from the training set.) Experiments indicate that about 500 images is sufficient for calibrating ImageNet classification networks.
To provide calibration data to TensorRT, implement the IInt8Calibrator interface. TensorRT provides multiple variants of IInt8Calibrator:
IEntropyCalibratorV2
This is the preferred calibrator and is required for DLA as it supports per activation tensor scaling.
IMinMaxCalibrator
This is the preferred calibrator for NLP tasks for all backends. It supports per activation tensor scaling.
IEntropyCalibrator
This is the legacy entropy calibrator which supports per channel scaling. This is less complicated than legacy calibrator and produces better results.
ILegacyCalibrator
This calibrator is for compatibility with 2.0EA. It is deprecated and should not be used.
The builder invokes the calibrator as follows:
  • First, it calls getBatchSize() to determine the size of the input batch to expect
  • Then, it repeatedly calls getBatch() to obtain batches of input. Batches should be exactly the batch size by getBatchSize(). When there are no more batches, getBatch() should return false.

Calibration can be slow, therefore, the IInt8Calibrator interface provides methods for caching intermediate data. Using these methods effectively requires a more detailed understanding of calibration.

When building an INT8 engine, the builder performs the following steps:
  1. Builds a 32-bit engine, runs it on the calibration set, and records a histogram for each tensor of the distribution of activation values.
  2. Builds a calibration table from the histograms.
  3. Builds the INT8 engine from the calibration table and the network definition.

The calibration table can be cached. Caching is useful when building the same network multiple times, for example, on multiple platforms. It captures data derived from the network and the calibration set. The parameters are recorded in the table. If the network or calibration set changes, it is the application’s responsibility to invalidate the cache.

The cache is used as follows:
  • if a calibration table is found, calibration is skipped, otherwise:
    • the calibration table is built from the histograms and parameters
  • then the INT8 network is built from the network definition and the calibration table.
Cached data is passed as a pointer and length.
After you have implemented the calibrator, you can configure the builder to use it:
builder->setInt8Calibrator(calibrator);

It is possible to cache the output of calibration using the writeCalibrationCache() and readCalibrationCache() methods. The builder checks the cache prior to performing calibration, and if data is found, calibration is skipped.

For more information about configuring INT8 Calibrator objects, see Performing Inference In INT8 Using Custom Calibration (sampleINT8) located in the GitHub repository.

5.1.4. Working With Explicit Precision Using C++

TensorRT 6.x.x supports explicit precision networks in which you can explicitly specify the precisions of all layers and tensors in the network. This feature enables the import of pre-quantized models with explicit quantizing and dequantizing scale layers into TensorRT.

To create an explicit precision network, the INetworkDefinition has to be created with createNetworkV2 as follows:
builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION)

Setting the network to be an explicit precision network implies that the precision of all the network input tensors and layer output tensors in the network are specified. See Setting The Layer Precision Using C++ for more information on setting the precision.

TensorRT will not quantize the weights of any layer, including those running in lower precision, if the network is marked as an explicit precision network - weights of low precision layers will simply be rounded-to-the-nearest and cast to the required precision.

You must not set dynamic ranges of tensors in an explicit precision network. Dynamic ranges of all tensors are [-127,127].

Conversion of activation values between higher and lower precision is performed using scale layers. TensorRT identifies special quantizing and dequantizing scale layers for explicit precision networks. A quantizing scale layer has FP32 input, INT8 output, per channel or per tensor scales and no shift weights. A dequantizing scale layer has INT8 input, FP32 output, per tensor scales and no shift weights. No shift weights are allowed for quantizing and dequantizing scale layers as only symmetric quantization is supported. Such mixed precision scale layers are only enabled for explicit precision networks.

For best performance, the special quantizing scale layers can be inserted immediately following Convolution and FullyConnected layers. In these cases, the scale layer is fused with the preceding layer.

Mixed Precision Using The Python API

5.2.1. Setting The Layer Precision Using Python

In Python, you can specify the layer precision using the precision flag:
layer.precision = trt.int8
You can set the output tensor data type to conform with the layer implementation:
layer.set_output_type(out_tensor_index, trt.int8)
Ensure that the builder understands to force the precision:
builder.strict_type_constraints = true

For more information, see the INT8 Calibration In Python (int8_caffe_mnist) sample.

5.2.2. Enabling FP16 Inference Using Python

In Python, set the fp16_mode flag as follows:
builder.fp16_mode = True
Force 16-bit precision by setting the builder flag:
builder.strict_type_constraints = True

5.2.3. Enabling INT8 Inference Using Python

Enable INT8 mode by setting the builder flag:
builder.int8_mode = True

Similar to the C++ API, you can choose per activation tensor the dynamic range either using dynamic_range or using INT8 calibration.

INT8 calibration can be used along with the dynamic range APIs. Setting the dynamic range manually will override the dynamic range generated from INT8 calibration.

5.2.3.1. Setting Per-Tensor Dynamic Range Using Python

In order to perform INT8 inference, you must set the dynamic range for each network tensor. You can derive the dynamic range values using various methods including quantization aware training or simply recording per tensor the min and max values during the last training epoch. To set the dynamic range use:
layer = network[layer_index]
tensor = layer.get_output(output_index)
tensor.dynamic_range = (min_float, max_float)
You also need to set the dynamic range for the network input:
input_tensor = network.get_input(input_index)
input_tensor.dynamic_range = (min_float, max_float)

5.2.3.2. INT8 Calibration Using Python

INT8 calibration provides an alternative approach to generate per activation tensor the dynamic range. This method can be categorized as a post training technique to generate the appropriate quantization scale.

The following steps illustrate how to create an INT8 calibrator object using the Python API. By default, TensorRT supports INT8 calibration.

  1. Import TensorRT:
    import tensorrt as trt
  2. Similar to test/validation files, use set of input files as calibration files dataset. Make sure the calibration files are representative of the overall inference data files. For TensorRT to use the calibration files, we need to create batchstream object. Batchstream object will be used to configure the calibrator.
    NUM_IMAGES_PER_BATCH = 5
    batchstream = ImageBatchStream(NUM_IMAGES_PER_BATCH, calibration_files)
    
  3. Create an Int8_calibrator object with input nodes names and batch stream:
    Int8_calibrator = EntropyCalibrator(["input_node_name"], batchstream)
  4. Set INT8 mode and INT8 calibrator:
    trt_builder.int8_calibrator = Int8_calibrator
    

    The rest of the logic for engine creation and inference is similar to Importing From ONNX Using Python.

5.2.4. Working With Explicit Precision Using Python

To create an explicit precision network using the Python API, pass the EXPLICIT_PRECISION flag to the builder:
network_creation_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_PRECISION)
self.network = self.builder.create_network(network_creation_flag)

6. Working With Reformat-Free Network I/O Tensors

Requirements from Automotive Safety Integrity Level (ASIL) for safety flows requires that accessing GPU address spaces should be removed from the NvMedia DLA safety path. To achieve this objective, reformat-free network I/O tensors are introduced to let you specify I/O formats that is supported by NvMedia tensor before passing the data to TensorRT.

On the other hand, the potential overhead of tensor reformatting can cause performance issues because TensorRT less than 6.0.1 assumes that network I/O tensors are FP32. In the case of multiple TensorRT sub-networks embedded into a large network, (for example, TensorFlow), with a precision of INT8 or FP16, the unavoidable I/O reformatting from and to FP32 could waste considerable memory traffic time. The same issue may also happen on the user-defined plugins. Now you can explicitly specify network I/O tensors to INT8 or FP16 formats to eliminate those unnecessary reformatting.

6.1. Building An Engine With Reformat-Free Network I/O Tensors

You can use the following API to specify formats of network I/O tensors. For example:

C++ API:
network->getInput(i)->setAllowedFormats(formats);
network->getOutput(i)->setAllowedFormats(formats);
Where i is the index of network I/O tensors.
Python API:1
network.get_input(0).allowed_formats = formats
network.get_output(0).allowed_formats = formats
Where formats is the mask form of the dense enum TensorFormat which sets the memory layout of the tensor. For example, 1U << TensorFormat::kLINEAR or, in Python, 1 << int(tensorrt.TensorFormat.LINEAR).

BuilderFlag::kSTRICT_TYPES or tensorrt.BuilderFlag.STRICT_TYPES in Python can be explicitly set to generate an engine without reformatting rather than getting the fastest path. For example:

C++ API:
builderConfig->setFlag(BuilderFlag::kSTRICT_TYPES);
Python API:
builder_config.set_flag(tensorrt.BuilderFlag.STRICT_TYPES)

The flag can also be set via a bit mask manner as follows.

C++ API:
builderConfig->setFlags(flags);
Python API:
builder_config.flags = flags
Where flags is the bit mask combination of the dense enum BuilderFlags. For example, 1 << BuilderFlag::kFP16 | 1<< BuilderFlag::kSTRICT_TYPES or, in Python,
1 << int(tensorrt.BuilderFlag.FP16) | 1 <<
        int(tensorrt.BuilderFlag.STRICT_TYPES)
.
If TensorRT doesn’t find any implementation of the reformat-free path, the following warning message displays:
‘[W] [TRT] Warning: no implementation obeys reformatting-free rules ...’

As a result, the fastest path will be picked instead.

If the combination of data type and memory layout of I/O tensors is well chosen, the overall performance of reformat-free path should be very close to the fastest path for most of normal cases or else it is recommended to disable reformat-free path.

For more information, see Specifying I/O Formats Using The Reformat Free I/O APIs (sampleReformatFreeIO) located in the <version>/samples/opensource/sampleReformatFreeIO directory in the GitHub repository for an example of setting reformat-free network I/O tensors with these APIs using C++.

6.2. Supported Combination Of Data Type And Memory Layout of I/O Tensors

The supported settings of I/O tensors are listed in the following table.
Table 2. Supported combination of data types and memory layout.
Memory Layout \ Data Type kINT32 kFLOAT kHALF kINT8
kLINEAR Supported Supported Supported Supported
kCHW2 N/A N/A Supported N/A
kCHW4 N/A N/A Supported Supported
kHWC8 N/A N/A Supported N/A
kCHW16 N/A N/A Only for DLA N/A
kCHW32 N/A Supported N/A Supported (including DLA)

6.3. Calibration For A Network With INT8 I/O Tensors

INT8 auto calibration is supported with INT8 I/O tensors. In this case, you will need to provide FP32 data for calibration and INT8 I/O tensors for inference.

With an INT8 I/O network, TensorRT will expect calibration data to be in FP32 precision to generate calibration cache. Calibration cache data will then be internally used by the builder during inference with INT8 I/O tensors.

This limitation of INT8 I/O networks requiring FP32 calibration data will be relaxed in future releases. For now, you can create FP32 calibration data by simply casting INT8 I/O calibration data to FP32 precision. You should also ensure that FP32 casted calibration data should be in range [-128.0f, 127.0f] and can be converted to INT8 data without any precision loss.

Setting up calibrator for a network with INT8 I/O tensors remain exactly the same as network with FP32 I/O tensors.

7. Working With Dynamic Shapes

Dynamic shapes is the ability to defer specifying some or all tensor dimensions until runtime. Dynamic shapes can be used via both the C++ and Python interfaces. They require the extended runtime.

The following sections provide greater detail, however, here’s an overview of the steps for building an engine with dynamic shapes:
  1. The network definition must not have an implicit batch dimension.
    C++
    Create the INetworkDefinition by calling
    IBuilder::createNetworkV2(1U <<
            static_cast<int>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH))
    .
    Python
    Create the tensorrt.INetworkDefinition by calling
    create_network(1 <<
            int(tensorrt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    .
    These calls request that the network not have an implicit batch dimension.
  2. Specify each runtime dimension of an input tensor by using -1 as a placeholder for the dimension.
  3. Specify one or more optimization profiles at build time that specify the permitted range of dimensions for inputs with runtime dimensions, and the dimensions for which the auto-tuner should optimize. For more information, see Optimization Profiles.
  4. To use the engine:
    1. Create an execution context from the engine, same as without dynamic shapes.
    2. Specify one of the optimization profiles from step 3 that covers the input dimensions.
    3. Specify the input dimensions for the execution context. After setting input dimensions, you can get the output dimensions that TensorRT computes for the given input dimensions.
    4. Enqueue work.
To change the runtime dimensions, repeat steps 4b and 4c, which do not have to be repeated until the input dimensions change.

7.1. Specifying Runtime Dimensions

When building a network, use -1 to denote a runtime dimension for an input tensor. For example, to create a 3D input tensor named foo where the last two dimensions will be specified at runtime, and the first dimension is fixed at build time, call:
C++
networkDefinition.addInput("foo", DataType::kFLOAT, Dims3(3, -1, -1))
Python
networkDefinition.add_input("foo", trt.float32,(3, -1, -1))
At run time, you’ll need to set the input dimensions after choosing an optimization profile (see Optimization Profiles). Let the bindingIndex of input foo be 0, and the input have dimensions [3,150,250]. After setting an optimization profile for the example above, you would call:
C++
context.setBindingDimensions(0, Dims3(3, 150, 250))
Python
context.set_binding_shape(0, (3, 150, 250))
At runtime, asking the engine for binding dimensions returns the same dimensions used to build the network, meaning, you will get a -1 for each runtime dimension. For example:
C++
engine.getBindingDimensions(0) returns a Dims with dimensions {3, -1, -1}.
Python
engine.get_binding_shape(0) returns (3, -1, -1).
To get the actual dimensions, which are specific to each execution context, query the execution context:
C++
context.getBindingDimensions(0) returns a Dims with dimensions {3, 150, 250}.
Python
context.get_binding_shape(0) returns (3, 150, 250).
After setting input dimensions, you can also use similar calls to query dimensions of the network outputs.

7.1.1. Optimization Profiles

An optimization profile describes a range of dimensions for each network input, and the dimensions that the auto-tuner should use for optimization. When using runtime dimensions, you must create at least one optimizations profile at build time. Two profiles can specify disjoint or overlapping ranges.

For example, one profile might specify a minimum size of [3,100,200], a maximum size of [3,200,300], and optimization dimensions of [3,150,250] while another profile might specify min, max and optimization dimensions of [3,200,100], [3,300,400], and [3,250,250].

To create an optimization profile, first construct an IOptimizationProfile. Then set the min, optimization, and max dimensions, and add it to the network configuration. Here are the calls for the first profile mentions above for an input foo:
C++
IOptimizationProfile* profile = builder.createOptimizationProfile();
profile->setDimensions(“foo”, OptProfileSelector::kMIN, Dims3(3,100,200);
profile->setDimensions(“foo”, OptProfileSelector::kOPT, Dims3(3,150,250);
profile->setDimensions(“foo”, OptProfileSelector::kMAX, Dims3(3,200,300);

config.addOptimizationProfile(profile)
Python
profile = builder.create_optimization_profile();
profile.set_shape(“foo”, (3, 100, 200), (3, 150, 250), (3, 200, 3)) 
config.add_optimization_profile(profile)
At runtime you need to set an optimization profile before setting input dimensions. Profiles are numbered in the order they were added, starting at 0. To choose the first optimization profile in the example, use:
C++
call context.setOptimizationProfile(0)
Python
set context.active_optimization_profile = 0

7.1.2. Layer Extensions For Dynamic Shapes

Some layers have optional inputs that allow specifying dynamic shape information, and there is a new layer IShapeLayer for accessing the shape of a tensor at runtime. Furthermore, some layers allow calculating new shapes. The next section goes into semantic details and restrictions. Here is a summary of what you might find useful in conjunction with dynamic shapes.

IShapeLayer outputs a 1D tensor containing the dimensions of the input tensor. For example, if the input tensor has dimensions [2,3,5,7], the output tensor will be a four-element 1D tensor containing {2,3,5,7}.

IResizeLayer accepts an optional second input containing the desired dimensions of the output.

IShuffleLayer accepts an optional second input containing the reshape dimensions, before the second transpose is applied. For example, the following network reshapes a tensor Y to have the same dimensions as X. Shuffle operations that are equivalent to identify operations on the underlying data will be omitted, if the input tensor is only used in the shuffle layer and the input and output tensors of this layer are not input and output tensors of the network. TensorRT no longer executes additional kernels or memory copies for such operations.

ISliceLayer accepts optional second, third, and fourth inputs containing the start, size, and stride.

IConcatenationLayer, IElementWiseLayer, IGatherLayer, IIdentityLayer, and
        IReduceLayer
can be used to do calculations on shapes and create new shapes.

7.1.3. Restrictions For Dynamic Shapes

The following layer restrictions arise from the fact that the layer’s weights have a fixed size:
  • IConvolutionLayer and IDeconvolutionLayer require that the channel dimension be a build-time constant.
  • IFullyConnectedLayer requires that the last three dimensions be build-time constants.
  • Int8 requires that the channel dimension be a build-time constant.

Values that must be build-time constants don’t have to be constants at the API level. TensorRT’s shape analyzer does element-by-element constant propagation through layers that do shape calculations. It’s sufficient that the constant propagation discover that a value is a build-time constant.

7.1.4. Execution Tensors vs. Shape Tensors

Engines using dynamic shapes employ a two-phase execution strategy:
  1. Compute the shapes of all tensors
  2. Stream work to the GPU.

Phase 1 is implicit and driven by demand, such as when output dimensions are requested. Phase 2 is the same as in prior versions of TensorRT. The two-phase execution puts some limits on dynamism that are important to understand.

The key limits are:
  • The rank of a tensor must be determinable at build time.
  • A tensor is either an execution tensor, shape tensor, or both. Tensors classified as shape tensors are subject to limits discussed below.

An execution tensor is a traditional TensorRT tensor. A shape tensor is a tensor that is related to shape calculations. It must be 0D or 1D, have type Int32, and its shape must be determinable at build time. For example, there is an IShapeLayer whose output is a 1D tensor containing the dimensions of the input tensor. The output is a shape tensor. IShuffleLayer accepts an optional second input that can specify reshaping dimensions. The second input must be a shape tensor.

Some layers are “polymorphic” with respect to the kinds of tensors they handle. For example, IElementWiseLayer can sum two INT32 execution tensors or sum two INT32 shape tensors. Which kind the tensors are depends on their ultimate use. If the sum is used to reshape another tensor, then it is a “shape tensor”.

7.1.4.1. Formal Inference Rules

The formal inference rules used by TensorRT for classifying tensors are based on a type-inference algebra. Let E denote an execution tensor and S denote a shape tensor. IShapeLayer has signature: IShapeLayer: E → S

since it takes an execution tensor as an input and a shape tensor as an output. IElementWiseLayer is polymorphic in this respect, with two signatures: IElementWiseLayer: S × S → S, E × E → E

For brevity, let’s adopt the convention that tis a variable denoting either class of tensor, and all tin a signature refer to the same class of tensor. Then the two signatures above can be written as a single polymorphic signature: IElementWiseLayer: t × tt

The two-input IShuffleLayer has a shape tensor as the second input, and is polymorphic with respect to the first input: IShuffleLayer (two inputs): t × S → t

IConstantLayer has no inputs, but can produce a tensor of either kind, so its signature is: IConstantLayer: → t

Here is the complete set of rules, which also serves as a reference for which layers can be used to manipulate shape tensors:
IConcatenationLayer: t × t × ...→ t
IConstantLayer: → t
IElementWiseLayer: t × t → t
IGatherLayer: t × t → t
IIdentityLayer: t → t
IReduceLayer: t → t
IResizeLayer (one input): E → E
IResizeLayer (two inputs): E × S → E
IShapeLayer: E → S
IShuffleLayer (one input): t → t
IShuffleLayer (two inputs): t × S → t
ISliceLayer (one input): t → t
ISliceLayer (two inputs): t × S → t
ISliceLayer (three inputs): t × S × S → t
ISliceLayer (four inputs): t × S × S × S → t
all other layers: E × ... → E × ...

Because an output can be the input of more than one subsequent layer, the inferred “types” are not exclusive. For example, an IConstantLayer might feed into a one use that requires an execution tensor and another use that requires a shape tensor. The output is of the IConstantLayer will be classified as both, and be used in both phase 1 and phase 2 of the two-phase execution.

The requirement that the rank of a shape tensor be known at build time limits how ISliceLayer can be used to manipulate a shape tensor. Specifically, if the third parameter, which specifies the size of the result, is not a build-time constant, the length of the resulting shape tensor would no longer be known at build time, breaking the restriction of shape tensors to build-time shapes. Worse, it might be used to reshape another tensor, breaking the restriction that tensor ranks must be known at build time.

TensorRT’s inferences can be inspected via methods ITensor::isShapeTensor(), which returns true for a shape tensor, and ITensor::isExecutionTensor(), which returns true for an execution tensor. Build the entire network first before calling these methods, because their answer can change depending on what uses have been added.

7.1.5. Shape Tensor I/O (Advanced)

Sometimes the need arises to do shape tensor I/O for a network. For example, consider a network consisting of solely an IShuffleLayer. TensorRT will infer that the second input is a shape tensor. ITensor::isShapeTensor will return true for it. Because it is an input shape tensor, TensorRT will and require two things for it:
  • At build time: the optimization profile values of the shape tensor.
  • At run time: the values of the shape tensor.

The shape of an input shape tensor is always known at build time. It’s the values that need to be described, since they can be used to specify the dimensions of execution tensors.

The optimization profile values can be set using IOptimizationProfile::setShapeValues. Analogous to how min, max, and optimization dimensions must be supplied for execution tensors with runtime dimensions, min, max and optimization values must be provided for shape tensors at build time.

The corresponding runtime method is IExecutionContext::setInputShapeBinding, which sets the values of the shape tensor at runtime.

Because inference of “execution tensor” vs “shape tensor” is based on ultimate use, TensorRT cannot infer whether a network output is a shape tensor. You must to tell it via method INetworkDefinition::markOutputForShapes.

Besides letting you output shape information for debugging, this feature is useful for composing engines. For example, consider building three engines, one each for subnetworks A, B, C where a connection from A to B, or B to C might involve a shape tensor. Build the networks in reverse order: C, B, and A. After constructing network C, you can use ITensor::isShapeTensor to determine if an input is a shape tensor, and use INetworkDefinition::markOutputForShapes to mark the corresponding output tensor in network B. Then check which inputs of B are shape tensors and mark the corresponding output tensor in network A.

8. Working With DLA

NVIDIA DLA (Deep Learning Accelerator) is a fixed function accelerator engine targeted for deep learning operations. DLA is designed to do full hardware acceleration of convolutional neural networks. DLA supports various layers such as convolution, deconvolution, fully-connected, activation, pooling, batch normalization, etc.

For more information about DLA support in TensorRT layers, see DLA Supported Layers. The trtexec tool has additional arguments to run networks on DLA, see trtexec.

To run the AlexNet network on DLA using trtexec in FP16 mode, issue:
 ./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --useDLACore=1 --fp16 --allowGPUFallback
To run the AlexNet network on DLA using trtexec in INT8 mode, issue:
 ./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --useDLACore=1 --int8 --allowGPUFallback

8.1. Running On DLA During TensorRT Inference

The TensorRT builder can be configured to enable inference on DLA. DLA support is currently limited to networks running in either FP16 or INT8 mode. The DeviceType enumeration is used to specify the device that the network or layer will execute on. The following API functions in the IBuilder class can be used to configure the network to use DLA:
setDeviceType(ILayer* layer, DeviceType deviceType)
This function can be used to set the deviceType that the layer must execute on.
getDeviceType(const ILayer* layer)
This function can be used to return the deviceType that this layer will execute on. If the layer is executing on the GPU, this will return DeviceType::kGPU.
canRunOnDLA(const ILayer* layer)
This function can be used to check if a layer can run on DLA.
setDefaultDeviceType(DeviceType deviceType)
This function sets the default deviceType to be used by the builder. It ensures that all the layers that can run on DLA will run on DLA, unless setDeviceType is used to override the deviceType for a layer.
getDefaultDeviceType()
This function returns the default deviceType which was set by setDefaultDeviceType.
isDeviceTypeSet(const ILayer* layer)
This function checks whether the deviceType has been explicitly set for this layer.
resetDeviceType(ILayer* layer)
This function resets the deviceType for this layer. The value is reset to the deviceType that is specified by setDefaultDeviceType or DeviceType::kGPU if none specified.
getMaxDLABatchSize(DeviceType deviceType)
This function returns the maximum batch size DLA can support.
Note: For any tensor, the total volume of index dimensions combined with the requested batch size should not exceed the value returned by this function.
allowGPUFallback(bool setFallBackMode)
This function notifies the builder to use GPU if a layer that was supposed to run on DLA cannot run on DLA. For more information, see GPU Fallback Mode.
reset(nvinfer1::INetworkDefinition& network)
This function can be used to reset the builder state, which sets the deviceType for all layers to be DeviceType::kGPU. After reset, the builder can be re-used to build another network with a different DLA config.
CAUTION:
In TensorRT 5.x.x, this resets the state for all networks and not the current network.
If the builder is not accessible, such as in the case where a plan file is being loaded online in an inference application, then the DLA to be utilized can be specified differently by using DLA extensions to the IRuntime. The following API functions in the IRuntime class can be used to configure the network to use DLA:
getNbDLACores()
This function returns the number of DLA cores that are accessible to the user.
setDLACore(int dlaCore)
The DLA core to execute on. Where dlaCore is a value between 0 and getNbDLACores() - 1. The default value is 0.

8.1.1. Example 1: sampleMNIST With DLA

This section provides details on how to run a TensorRT sample with DLA enabled. The "Hello World" For TensorRT (sampleMNIST) located in the GitHub repository, sample demonstrates how to import a trained Caffe model, build the TensorRT engine, serialize and deserialize the engine and finally use the engine to perform inference.

The sample first creates the builder:
auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(gLogger));
if (!builder) return false;
builder->setMaxBatchSize(batchSize);
builder->setMaxWorkspaceSize(16_MB);
Then, enable GPUFallback mode:
config->setFlag(BuilderFlag::kGPU_FALLBACK);
config->setFlag(BuilderFlag::kFP16); or config->setFlag(BuilderFlag::kINT8);
Enable execution on DLA, where dlaCore specifies the DLA core to execute on:
config->setDefaultDeviceType(DeviceType::kDLA);
config->setDLACore(dlaCore);
With these additional changes, sampleMNIST is ready to execute on DLA. To run sampleMNIST with DLA Core 1, use the following command:
 ./sample_mnist --useDLACore=1 [--int8|--fp16]

8.1.2. Example 2: Enable DLA Mode For A Layer During Network Creation

In this example, let’s create a simple network with input, convolution and output.
  1. Create the builder and the network:
    IBuilder* builder = createInferBuilder(gLogger);
    INetworkDefinition* network = builder->createNetwork();
  2. Add the Input layer to the network, with the input dimensions.
    auto data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});
  3. Add the convolution layer with hidden layer input nodes, strides, and weights for filter and bias.
    auto conv1 = network->addConvolution(*data->getOutput(0), 20, DimsHW{5, 5}, weightMap["conv1filter"], weightMap["conv1bias"]);
    conv1->setStride(DimsHW{1, 1});
    
  4. Set the convolution layer to run on DLA:
    if(canRunOnDLA(conv1))
    {
    config->setFlag(BuilderFlag::kFP16); or config->setFlag(BuilderFlag::kINT8);
    builder->setDeviceType(conv1, DeviceType::kDLA); 
    
    }
    
  5. Mark the output:
    network->markOutput(*conv1->getOutput(0));
    
  6. Set the DLA engine to execute on:
    engine->setDLACore(0)
    

8.2. DLA Supported Layers

This section lists the layers supported by DLA along with the constraints associated with each layer.

Generic restrictions while running on DLA (applicable to all layers)

  • Max batch size supported is 32.
    Note: Batch size for DLA is the product of all index dimensions except the CHW dimensions. For example, if input dimensions are NPQRS, the effective batch size is N*P.

Layer specific restrictions

Convolution, Deconvolution, and FullyConnected Layers
Convolution and Deconvolution Layers
  • Width and height of kernel size must be in the range [1, 32] for Convolution and Deconvolution layers
  • Width and height of padding must be in the range [0, 31]
  • Width and height of stride must be in the range [1,8] for Convolution Layer
  • Number of output maps must be in the range [1, 8192]
  • Axis must be 1
  • Grouped and dilated convolution supported. Dilation values must be in the range [1,32]
  • Grouped convolutions and deconvolutions are not supported in INT8
Pooling Layer
  • Operations supported: kMAX, kAVERAGE
  • Width and height of the window size must be in the range [1, 8]
  • Width and height of padding must be in the range [0, 7]
  • Width and height of stride must be in the range [1, 16]
Activation Layer
  • Functions supported: ReLU, Sigmoid, Hyperbolic Tangent
    • Negative slope not supported for ReLU
  • Only ReLU operation is supported in INT8
ElementWise Layer
  • Operations supported: Sum, Sub, Product, Max, and Min
  • Only Sum operation is supported in INT8
Scale Layer
  • Mode supported: Uniform, Per-Channel, and Elementwise
LRN (Local Response Normalization) Layer
  • Window size is configurable to 3, 5, 7, or 9
  • Normalization region supported is: ACROSS_CHANNELS
  • LRN layer is not supported in INT8
Concatenation Layer
  • DLA supports concatenation only along the channel axis

8.3. GPU Fallback Mode

The GPUFallbackMode sets the builder to use GPU if a layer that was marked to run on DLA could not run on DLA. A layer may not run on DLA due to the following reasons:
  1. The layer operation is not supported on DLA.
  2. The parameters specified are out of supported range for DLA.
  3. The given batch size exceeds the maximum permissible DLA batch size. For more information, see DLA Supported Layers.
  4. A combination of layers in the network causes the internal state to exceed what the DLA is capable of supporting.
  5. There are no DLA engines available on the platform.

If the GPUFallbackMode is set to false, a layer set to execute on DLA, that couldn't run on DLA will result in an error. However, with GPUFallbackMode set to true, it will continue to execute on the GPU instead, after reporting a warning.

Similarly, if defaultDeviceType is set to DeviceType::kDLA and GPUFallbackMode is set to false, it will result in an error if any of the layers can't run on DLA. With GPUFallbackMode set to true, it will report a warning and continue executing on the GPU.

9. Deploying A TensorRT Optimized Model

After you’ve created a plan file containing your optimized inference model, you can deploy that file into your production environment. How you create and deploy the plan file will depend on your environment. For example, you may have a dedicated inference executable for your model that loads the plan file and then uses the TensorRT Execution API to pass inputs to the model, execute the model to perform inference, and finally read outputs from the model.

This section discusses how TensorRT can be deployed in some common deployment environments.

9.1. Deploying In The Cloud

One common cloud deployment strategy for inferencing is to expose a model through a server that implements an HTTP REST or gRPC endpoint for the model. A remote client can then perform inferencing by sending a properly formatted request to that endpoint. The request will select a model, provide the necessary input tensor values required by the model, and indicate which model outputs should be calculated.

To take advantage of TensorRT optimized models within this deployment strategy does not require any fundamental change. The inference server must be updated to accept models represented by TensorRT plan files and must use the TensorRT Execution APIs to load and executes those plans. An example of an inference server that provides a REST endpoint for inferencing can be found in the TensorRT Inference Server Container Release Notes and TensorRT Inference Server Guide.

9.2. Deploying To An Embedded System

TensorRT can also be used to deploy trained networks to embedded systems such as NVIDIA Drive PX. In this context, deployment means taking the network and using it in a software application running on the embedded device, such as an object detection or mapping service. Deploying a trained network to an embedded system involves the following steps:
  1. Export the trained network to a format such as UFF or ONNX which can be imported into TensorRT (see Working With Deep Learning Frameworks for more details).
  2. Write a program that uses the TensorRT C++ API to import, optimize, and serialize the trained network to a plan file (see sections Working With Deep Learning Frameworks, Working With Mixed Precision, and Performing Inference In C++). For the purpose of discussion, let’s call this program make_plan.
    1. Optionally, perform INT8 calibration and export a calibration cache (see Working With Mixed Precision).
  3. Build and run make_plan on the host system to validate the trained model before deployment to the target system.
  4. Copy the trained network (and INT8 calibration cache, if applicable) to the target system. Re-build and re-run the make_plan program on the target system to generate a plan file.
    Note: The make_plan program must run on the target system in order for the TensorRT engine to be optimized correctly for that system. However, if an INT8 calibration cache was produced on the host, the cache may be re-used by the builder on the target when generating the engine (in other words, there is no need to do INT8 calibration on the target system itself).

    After the plan file has been created on the embedded system, an embedded application can create an engine from the plan file and perform inferencing with the engine by using the TensorRT C++ API. For more information, see Performing Inference In C++.

    To walk through a typical use case where a TensorRT engine is deployed on an embedded system, see:

10. Working With Deep Learning Frameworks

With the Python API, an existing model built with TensorFlow, Caffe, or an ONNX compatible framework can be used to build a TensorRT engine using the provided parsers. The Python API also supports frameworks that store layer weights in a NumPy compatible format, for example PyTorch.

10.1. Working With TensorFlow

You can work with TensorFlow in the following ways:
TF-TRT
This method accelerates a TensorFlow graph with TensorRT even if there are TensorFlow operators in the graph that are not supported by TensorRT (or TF-TRT). The subgraphs that are supported by TensorRT and TF-TRT are accelerated and the resulting graph is still a TensorFlow graph that you can execute as usual. For step-by-step instructions on how to accelerate inference in TF-TRT, see the TF-TRT User Guide and Release Notes. For TF-TRT examples, see Examples for TensorRT in TensorFlow (TF-TRT).
UFF
This method works only if the whole graph can be converted to UFF and can be accelerated by TensorRT. For information on using TensorRT with a TensorFlow model, see the “Hello World” For TensorRT Using TensorFlow And Python (end_to_end_tensorflow_mnist) sample.

10.1.1. Freezing A TensorFlow Graph

In order to use the command-line UFF utility, TensorFlow graphs must be frozen and saved as .pb files. For more information, see:

10.1.2. Freezing A Keras Model

You can use the following sample code to freeze a Keras model.
from keras.models import load_model
import keras.backend as K
from tensorflow.python.framework import graph_io
from tensorflow.python.tools import freeze_graph
from tensorflow.core.protobuf import saver_pb2
from tensorflow.python.training import saver as saver_lib

def convert_keras_to_pb(keras_model, out_names, models_dir, model_filename):
	model = load_model(keras_model)
	K.set_learning_phase(0)
	sess = K.get_session()
	saver = saver_lib.Saver(write_version=saver_pb2.SaverDef.V2)
	checkpoint_path = saver.save(sess, 'saved_ckpt', global_step=0, latest_filename='checkpoint_state')
	graph_io.write_graph(sess.graph, '.', 'tmp.pb')
	freeze_graph.freeze_graph('./tmp.pb', '',
                          	False, checkpoint_path, out_names,
                          	"save/restore_all", "save/Const:0",
                          	models_dir+model_filename, False, "")

10.1.3. Converting A Frozen Graph To UFF

You can use the following sample code to convert the .pb frozen graph to .uff format file.
convert-to-uff input_file [-o output_file] [-O output_node]
You can list the TensorFlow layers:
convert-to-uff input_file -l

10.1.4. Working With TensorFlow RNN Weights

This section provides information about TensorFlow weights and their stored formats. Additionally, the following sections will guide you on how to approach and decrypt RNN weights from TensorFlow.

10.1.4.1. TensorFlow RNN Cells Supported In TensorRT

An RNN layer in TensorRT can be thought of as a MultiRNNCell from TensorFlow. One layer consists of sublayers with the same configurations, in other words, hidden and embedding size. This encapsulation is done so that the internal connections between the multiple sublayers can be abstracted away from the user. This allows for simpler code when deeper networks are involved.

TensorRT supports four different RNN layer types. These layer types are RNN relu, RNN tanh, LSTM, and GRU. The TensorFlow cells that match these types are:

TensorRT RNN Relu/Tanh Layer
  1. BasicRNNCell
    1. Permitted activation functions: tf.tanh and tf.nn.relu.
    2. This is a platform independent cell.
TensorRT LSTM Layer
  1. BasicLSTMCell
    1. forget_bias must be set to 0 when creating an instance of this cell in TensorFlow. To support a non-zero forget bias, you need to preprocess the bias by adding the parameterized forget bias to the dumped TensorFlow forget biases.
    2. This is a platform independent cell.
  2. CudnnCompatibleLSTMCell
    1. Same condition for the forget bias applies to this cell as it does to the BasicLSTMCell.
    2. TensorRT does not currently support peepholes so use_peepholes must be set to False.
    3. This is a cuDNN compatible cell.
TensorRT GRU Layer
  1. CudnnCompatibleGRUCell
    1. This is a cuDNN compatible cell.
    2. Differs in implementation from standard, platform independent GRU cells. Due to this, CudnnCompatiableGRUCell is the correct cell to use with TensorRT.

10.1.4.2. Maintaining Model Consistency Between TensorFlow And TensorRT

For any TensorFlow cell not listed in TensorFlow RNN Cells Supported In TensorRT, consult the TensorRT API and TensorFlow API to ensure the cell is mathematically equivalent to what TensorRT supports and the storage format is consistent with the format that you are expecting. One good way of doing this is to set up unit tests to validate the output from TensorRT by using TensorFlow as the ground truth.

10.1.4.3. Workflow

We will be using the following workflow to extract and use TensorFlow weights:

Figure 5. TensorFlow RNN Workflow

10.1.4.4. Dumping The TensorFlow Weights

Python script dumpTFWts.py can be used to dump all the variables and weights from a given TensorFlow checkpoint. The script is located in the /usr/src/tensorrt/samples/common/dumpTFWts.py directory. Issue dumpTFWts.py -h for more information on the usage of this script.

10.1.4.5. Loading Dumped Weights

Function loadWeights() loads from the dump of the dumpTFWts.py script. It has been provided as an example in Building An RNN Network Layer By Layer (sampleCharRNN) located in the GitHub repository. The function signature is:
std::map<std::string, Weights> loadWeights(const std::string file, std::unordered_set<std::string> names);

This function loads the weights specified by the names set from the specified file and returns them in a std::map<std::string, Weights>.

10.1.4.6. Converting The Weights To A TensorRT Format

At this point, we are ready to convert the weights. To do this, the following steps are required:
  1. Understanding and using the TensorFlow checkpoint to get the tensor.
  2. Understanding and using the tensors to extract and reformat relevant weights and set them to the corresponding layers in TensorRT.

10.1.4.6.1. TensorFlow Checkpoint Storage Format

There are two possible TensorFlow checkpoint storage formats:
  1. Platform independent format - separated by layer
    1. Cell_i_kernel <Weights>
    2. Cell_i_bias <Weights>
  2. cuDNN compatible format - separated by input and recurrent
    1. Cell_i_Candidate_Input_kernel <Weights>
    2. Cell_i_Candidate_Hidden_kernel <Weights>

In other words, 1.1 Cell_i_kernel <Weights> in the concatenation of 2.1 Cell_i_Candidate_Input_kernel <Weights> and 2.2 Cell_i_Candidate_Hidden_kernel <Weights>. Therefore, storage format 2 is simply a more fine-grain version of storage format 1.

10.1.4.6.2. TensorFlow Kernel Tensor Storage Format

Before storing the weights in the checkpoint, TensorFlow transposes and then interleaves the rows of transposed matrices. The order of the interleaving is described in the next section. A figure is provided in BasicLSTMCell Example to further illustrate this format.

Gate Order Based On Layer Operation Type The transposed weight matrices are interleaved in the following order:
  1. RNN RuLU/Tanh:
    1. input gate (i)
  2. LSTM:
    1. input gate (i), cell gate (c) , forget gate (f), output gate (o)
  3. GRU:
    1. reset (r), update (u)

10.1.4.6.3. Kernel Weights Conversion To A TensorRT Format

Converting the weights from TensorFlow format can be summarized in two steps.
  1. Reshape the weights to push the interleaving down to a lower dimension.
  2. Transpose the weights to get rid of the interleaving completely and have the weight matrices stored contiguously in memory.

Transformation Utilities To help perform these transformations correctly, reorderSubBuffers(), transposeSubBuffers(), and reshapeWeights() are functions that have been provided. For more information, see NvUtils.h.

10.1.4.6.4. TensorFlow Bias Weights Storage Format

The bias tensor is simply stored as contiguous vectors concatenated in the order specified in TensorFlow Kernel Tensor Storage Format. If the checkpoint storage is platform independent, then TensorFlow combines the recurrent and input biases into a single tensor by adding them together. Otherwise, the recurrent and input biases and stored in separate tensors.

10.1.4.6.5. Bias Tensor Conversion To TensorRT Format

Since the biases are stored as contiguous vectors, there aren’t any transformations that need to be applied to get the bias into the TensorRT format.

10.1.4.7. BasicLSTMCell Example

10.1.4.7.1. BasicLSTMCell Kernel Tensor

To understand the format in which these tensors are being stored, let us consider an example of a BasicLSTMCell. Figure 6 illustrates what the tensor looks like within the TensorFlow checkpoint.

Figure 6. Tensors within a TensorFlow checkpoint Tensors within a TensorFlow checkpoint
Note:DS/Data Size is distinct from Hidden Size for the first layer. For all the following sublayers Data Size is equal to Hidden Size.
In Figure 6, W represents the input weights, R represents the hidden weights, DS represents the data size, and HS represents hidden size.
Note: Since this is a platform independent cell, the input weights and hidden weights have been concatenated together. If we had used a CudnnCompatibleLSTMCell, then these weights would have been split into two separate tensors.

Applying the conversion process discussed earlier will result in the converted tensor shown in Figure 7.

Figure 7. Converted tensors Converted tensors
Note:Data Size is distinct from Hidden Size for the first layer in the sequence of RNN sublayers. For all the following sublayers Data Size is equal to Hidden Size.

10.1.4.7.2. BasicLSTMCell Bias Tensor

Figure 8 illustrates the format in which the bias tensor is stored.

Figure 8. Bias tensor stored format Bias tensor stored format
Because this is a platform independent cell, W in the image above represents the result of ElementWise adding the input and recurrent biases together. TensorFlow does this addition internally to save memory before it stores the tensor.
Note: This is already in the format we require, therefore, we do not need to apply any transformations.

10.1.4.8. Setting The Converted Weights And Biases

The converted tensors for both the weights and bias are now ready to use. You need to iterate over the tensors in the order specified in TensorFlow Kernel Tensor Storage Format and set the weights and bias using IRNNv2Layer::setWeightsForGate() and IRNNv2Layer::setBiasForGate() functions, respectively.
Note: If you are using a platform independent cell, you will need to set all the recurrent biases manually using zeroed out dummy weights.

A real-world example of the training, dumping, converting, and setting process is described in Building An RNN Network Layer By Layer (sampleCharRNN) located in the <version>/samples/opensource/sampleCharRNN directory in the GitHub repository. For more information, consult the code in this sample

10.1.5. Preprocessing A TensorFlow Graph Using the Graph Surgeon API

The graph surgeon API, also known as graphsurgeon, allows you to transform TensorFlow graphs. Its capabilities are broadly divided into two categories:
Search
The search functions allow you to find nodes in a TensorFlow graph.
Manipulation
The manipulation functions allow you to modify, add, or remove nodes.

Using graphsurgeon, you can mark certain nodes (or sets of nodes) as plugin nodes in the graph. These plugins can either be plugins shipped with TensorRT or plugins written by you. For more information, see Extending TensorRT With Custom Layers.

If you are writing a plugin, also refer to see Extending TensorRT With Custom Layers for details on how to implement the IPluginExt and IPluignCreator classes in addition to registering the plugin.

The following code snippet illustrates how to use graphsurgeon to map a TensorFlow Leaky ReLU operation to a TensorRT Leaky ReLU plugin node.
import graphsurgeon as gs
lrelu_node = gs.create_plugin_node(name=”trt_lrelu”, op=”LReLU_TRT”, negSlope=0.2)
namespace_plugin_map = { “tf_lrelu” : lrelu_node }  

# Transform TensorFlow graph using graphsurgeon and save to UFF
dynamic_graph = gs.DynamicGraph(tf_lrelu.graph)
dynamic_graph.collapse_namespaces(namespace_plugin_map)

# Run UFF converter using new graphdef
uff_model = uff.from_tensorflow(dynamic_graph.as_graph_def(), ["trt_lrelu"], output_filename=”test_lrelu.uff”, text=True)

In the above code, the op field in the create_plugin_node method should match the registered plugin name. This enables the UFF parser to look up the Plugin in the Plugin Registry using this field to insert the plugin node into the network.

For a working graphsurgeon example, see Object Detection With A TensorFlow SSD Network (sampleUffSSD) located in the GitHub repository.

For more details about the graphsurgeon API, see the graph surgeon API.

10.2. Working With PyTorch And Other Frameworks

Using TensorRT with PyTorch and other frameworks involves replicating the network architecture using the TensorRT API, and then copying the weights from PyTorch (or any other framework with NumPy compatible weights). For more information on using TensorRT with a PyTorch model, see the "Hello World" For TensorRT Using PyTorch And Python (network_api_pytorch_mnist) sample.

11. Working With DALI

TensorRT can be integrated with NVIDIA® Data Loading Library™ (DALI); a collection of highly optimized building blocks and an execution engine to accelerate input data pre-processing for deep learning applications.

DALI is an open sourced library available on GitHub. It is a highly optimized open source library for data preprocessing. It uses an execution engine for fast preprocessing pipeline. DALI accelerates blocks for image loading and augmentation and also provides GPU support for JPEG decoding and image manipulation.

For more information about DALI, see the DALI data loading documentation.

11.1. Benefits Of Integration

The benefits of integrating DALI with TensorRT include the following:
  • Running DNN models requires input data pre-processing
  • Computational complexity of the I/O pipeline has increased. Hence, the GPU starves for data. DALI helps accelerating the preprocessing pipeline.
  • DALI offsets the compute intensive data pre-processing to GPU.
  • Pre-processing involves decoding, resize, crop, spatial augmentation, format conversions (NCHW and NHWC)
  • Multi device DNN inference could be achieved via same I/O pipeline
DALI supports:
  • The feature to accelerate pre-processing on GPUs
  • Configurable graphs and custom operators
  • Multiple input formats (for example JPEG, LMDB, RecordIO, TFRecord)
  • Serializing a whole graph (portable graph)
  • Easily integrates with framework plugins and open source bindings

DALI supports a custom operator library which can be loaded in runtime. TensorRT inference can be configured as a custom operator to be a part of DALI pipeline. A working example of TensorRT inference integrated as a part of DALI is open sourced here.

For more information about integrating DALI with TensorRT on Xavier, see the GTC 2019 talk here.

12. Troubleshooting

The following sections help answer the most commonly asked questions regarding typical use cases.

12.1. FAQs

Q: How do I create an engine that is optimized for several different batch sizes?

A: While TensorRT allows an engine optimized for a given batch size to run at any smaller size, the performance for those smaller sizes may not be as well-optimized. To optimize for multiple different batch sizes, run the builder and serialize an engine for each batch size.

Q: Are engines and calibration tables portable across TensorRT versions?

A: No. Internal implementations and formats are continually optimized and may change between versions. For this reason, engines and calibration tables are not guaranteed to be binary compatible with different versions of TensorRT. Applications should build new engines and INT8 calibration tables when using a new version of TensorRT.

Q: How do I choose the optimal workspace size?

A: Some TensorRT algorithms require additional workspace on the GPU. The method IBuilder::setMaxWorkspaceSize() controls the maximum amount of workspace that may be allocated, and will prevent algorithms that require more workspace from being considered by the builder. At runtime, the space is allocated automatically when creating an IExecutionContext. The amount allocated will be no more than is required, even if the amount set in IBuilder::setMaxWorkspaceSize() is much higher. Applications should therefore allow the TensorRT builder as much workspace as they can afford; at runtime TensorRT will allocate no more than this, and typically less.

Q: How do I use TensorRT on multiple GPUs?

A: Each ICudaEngine object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext is bound to the same GPU as the engine from which it was created. When calling execute() or enqueue(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.

Q: How do I get the version of TensorRT from the library file?

A: There is a symbol in the symbol table named tensorrt_version_#_#_#_# which contains the TensorRT version number. One possible way to read this symbol on Linux is to use the nm command like in the example below:
$ nm -D libnvinfer.so.4.1.0 | grep tensorrt_version
000000000c18f78c B tensorrt_version_4_0_0_7

Q: What can I do if my network is producing the wrong answer?

A: There are several reasons why your network may be generating incorrect answers. Here are some troubleshooting approaches which may help diagnose the problem:
  • Turn on INFO level messages from the log stream and check what TensorRT is reporting.
  • Check that your input preprocessing is generating exactly the input format required by the network.
  • If you’re using reduced precision, run the network in FP32. If it produces the correct result, it is possible that lower precision has insufficient dynamic range for the network.
  • Try marking intermediate tensors in the network as outputs, and see if they match what you are expecting.
    Note: Marking tensors as outputs may inhibit optimizations, and therefore, may change the results.

Q: How do I determine how much device memory will be required by my network?

A: TensorRT engines use device memory for two purposes: to hold the weights required by the network, and to hold the intermediate activations required by IExecutionContext. The size of the weights can be closely approximated by the size of the serialized engine (in fact this will be a slight overestimate, as the serialized engine also includes the network definition). The size of the activation memory required can be determined by calling ICudaEngine::getDeviceMemorySize() . The sum of these will be the amount of device memory TensorRT allocates for the engine.
IBuilder may temporarily use more device memory than what the engine requires.
  • During a phase, it uses twice as much memory for the weights required by the engine. During that phase, no memory is allocated for activations.
  • The auto-tuner times each layer for FP32 operation. Timing a layer in FP32 consumes twice as much memory as an FP16 operation, and four times as much for an INT8 operation, both for the weights and its input/output activations. The additional memory consumption for timing is theoretically noticeable if a single layer dominates overall memory consumption of a network.
Note: The CUDA infrastructure and device code also consume device memory. The amount of memory will vary by platform, device, and TensorRT version. Use cudaGetMemInfo to determine the total amount of device memory in use.

Q: If I build the engine on one GPU and run the engine on another GPU, will this work?

A: We recommend that you don’t, however, if you do, you’ll need to follow these guidelines:
  1. The major, minor, and patch versions of TensorRT must match between systems. This ensures you are picking kernels that are still present and have not undergone certain optimizations or bug fixes that would change their behavior.
  2. The CUDA compute capability major and minor versions must match between systems. This ensures that the same hardware features are present so the kernel will not fail to execute. An example would be mixing cards with different precision capabilities.
  3. The following properties should match between systems:
    • Maximum GPU graphics clock speed
    • Maximum GPU memory clock speed
    • GPU memory bus width
    • Total GPU memory
    • GPU L2 cache size
    • SM processor count
    • Asynchronous engine count

    If any of the above properties do not match, you will receive the following warning: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.

    If you still want to proceed, then you should build the engine on the smallest SKU in the family because autotuner choices made on smaller GPUs will generalize better.

Q: How do I implement batch normalization in TensorRT?

A: Batch normalization can be implemented using a sequence of IElementWiseLayer in TensorRT. More specifically:
adjustedScale = scale / sqrt(variance + epsilon) 
batchNorm = (input + bias - (adjustedScale * mean)) * adjustedScale

Q: Can I use multiple TensorRT builders to compile on different targets?

A: TensorRT assumes that all resources for the device it is building on are available for optimization purposes. Concurrent use of multiple TensorRT builders (for example, multiple trtexec instances) to compile on different targets (DLA0, DLA1 and GPU) may oversubscribe system resources causing undefined behavior (meaning, inefficient plans, builder failure, or system instability).

It is recommended to use trtexec with the --saveEngine argument to compile for different targets (DLA and GPU) separately and save their plan files. Such plan files can then be reused for loading (using trtexec with the --loadEngine argument) and submitting multiple inference jobs on the respective targets (DLA0, DLA1, GPU). This two step process alleviates over-subscription of system resources during the build phase while also allowing execution of the plan file to proceed without interference by the builder.

12.2. How Do I Report A Bug?

We appreciate all types of feedback. If you encounter any issues, please report them by following these steps:

  1. Register for the NVIDIA Developer website.
  2. Log into the developer site.
  3. Click on your name in the upper right corner.
  4. Click My account > My Bugs and select Submit a New Bug.
  5. Fill out the bug reporting page. Be descriptive and if possible, provide the steps that you are following to help reproduce the problem.
  6. Click Submit a bug.

12.3. Understanding Error Messages

If an error is encountered during execution, TensorRT will report an error message that is intended to help in debugging the problem. Some common error messages that may be encountered by developers are discussed in the following sections.

UFF Parser Error Messages

The following table captures common UFF parser error messages.
Error Message Description
The input to the Scale Layer is required to have a minimum of 3 dimensions.
This error message can occur due to incorrect input dimensions. In UFF, input dimensions should always be specified with the implicit batch dimension not included in the specification.
Invalid scale mode, nbWeights: <X>
kernel weights has count <X> but <Y> was expected
<NODE> Axis node has op <OP>, expected Const. The axis must be specified as a Const node.
As indicated by the error message, the axis must be a build-time constant in order for UFF to parse the node correctly.

ONNX Parser Error Messages

The parser may issue error messages if a constant input is used with a layer that does not support constant inputs. Consider using a tensor input instead.

TensorRT Core Library Error Messages

The following table captures common TensorRT core library error messages.
  Error Message Description
Installation Errors
Cuda initialization failure with error <code>. Please check cuda installation:  http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html.
This error message can occur if the CUDA or NVIDIA driver installation is corrupt. Refer to the URL for instructions on installing CUDA and the NVIDIA driver on your operating system.
Builder Errors
Internal error: could not find any implementation for node <name>, try increasing the workspace size with IBuilder::setMaxWorkspaceSize()
This error message occurs because there is no layer implementation for the given node in the network that can operate with the given workspace size. This usually occurs because the workspace size is insufficient, but could also indicate a bug. If increasing the workspace size as suggested doesn’t help, report a bug (see How Do I Report A Bug?).
<layer-name>: (kernel|bias) weights has non-zero count but null values
<layer-name>: (kernel|bias) weights has zero count but non-null values
This error message occurs when there is a mismatch between the values and count fields in a Weights data structure passed to the builder. If the count is 0, then the values field should contain a null pointer; otherwise the count must be non-zero and values should contain a device pointer.
Builder was created on device different from current device.
This error message may show up if you:
  1. Created an IBuilder targeting one GPU, then
  2. Called cudaSetDevice() to target a different GPU, then
  3. Attempted to use the IBuilder to create an engine.
Ensure you only use the IBuilder when targeting the GPU that was used to create the IBuilder.
You may encounter error messages indicating that the tensor dimensions do not match the semantics of the given layer. Carefully read the documentation on NvInfer.h on the usage of each layer and the expected dimensions of the tensor inputs and outputs to the layer.
INT8 Calibration Errors
Tensor <X> is uniformly zero; network calibration failed.
This warning occurs, and should be treated as an error, when data distribution for a tensor is uniformly zero. In a network, the output tensor distribution can be uniformly zero under the following scenarios:
  1. Constant tensor with all zero values; not an error.
  2. Activation (ReLU) output with all negative inputs: not an error.
  3. Data distribution is forced to all zero due to computation error in previous layer; emit a warning here.2
  4. User does not provide any calibration images; emit a warning here.1
Could not find scales for tensor <X>.
This error message indicates that a calibration failure occurred with no scaling factors detected. This could be due to no INT8 calibrator or insufficient custom scales for network layers. For more information, refer to the Performing Inference In INT8 Using Custom Calibration (sampleINT8) located in the /opensource/sampleINT8 directory in the GitHub repository to setup calibration correctly.
Engine Compatibility Errors
The engine plan file is not compatible with this version of TensorRT, expecting (format|library) version <X> got <Y>, please rebuild.
This error message can occur if you are running TensorRT using an engine PLAN file that is incompatible with the current version of TensorRT. Ensure you use the same version of TensorRT when generating the engine and running it.
The engine plan file is generated on an incompatible device, expecting compute <X> got compute <Y>, please rebuild.
This error message can occur if you build an engine on a device of a different compute capability than the device that is used to run the engine. If you build an engine on a device with the same compute capability but is not identical to the device that is used to run the engine, you will see the following warning:
Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
As indicated by the warning, it is highly recommended to use a device of the same model when generating the engine and deploying it to avoid compatibility issues.
Out Of Memory Errors
GPU memory allocation failed during initialization of (tensor|layer): <name>
GPU memory
These error messages can occur if there is insufficient GPU memory available to instantiate a given TensorRT engine. Verify that the GPU has sufficient available memory to contain the required layer weights and activation tensors.
Allocation failed during deserialization of weights.
GPU does not meet the minimum memory requirements to run this engine …
FP16 Errors
Network needs native FP16 and platform does not have native FP16
This error message can occur if you attempted to deserialize an engine that uses FP16 arithmetic on a GPU that does not support FP16 arithmetic. You will either need to rebuild the engine without FP16 precision inference or upgrade your GPU to a model that supports FP16 precision inference.
Plugin Errors
Custom layer <name> returned non-zero initialization
This error message can occur if the initialize() method of a given plugin layer returns a non-zero value. Refer to the implementation of that layer to debug this error further. For more information, see TensorRT Layers.

12.4. Support

Support, resources, and information about TensorRT can be found online at https://developer.nvidia.com/tensorrt. This includes blogs, samples, and more.

In addition, you can access the NVIDIA DevTalk TensorRT forum at https://devtalk.nvidia.com/default/board/304/tensorrt/ for all things related to TensorRT. This forum offers the possibility of finding answers, make connections, and to get involved in discussions with customers, developers, and TensorRT engineers.

A. Appendix

A.1. TensorRT Layers

In TensorRT, layers represent distinct flavours of mathematical and/or programmatic operations. The following sections describe every layer that is supported by TensorRT. To view a list of the specific attributes that are supported by each layer, refer to the TensorRT API documentation.

TensorRT has the ability to optimize performance by fusing layers. For information about how to enable layer fusion optimizations, see Types Of Fusions. For information about how to optimize layer performance, see How Do I Optimize My Layer Performance? from the TensorRT Best Practices guide.

A.1.1. IActivationLayer

The IActivationLayer implements element-wise activation functions.

Layer Description

Apply an activation function on an input tensor A, and produce an output tensor B with the same dimensions.

The Activation layer supports the following operations:
rectified Linear Unit (ReLU): B = ReLU(A)
Hyperbolic tangent: B = tanh(A)
“s” shaped curve (sigmoid): B = σ(A)

Conditions And Limitations

None

A.1.2. IConcatenationLayer

The IConcatenationLayer links together multiple tensors of the same non-channel sizes along the channel dimension.

Layer Description

The concatenation layer is passed in an array of m input tensors Ai and a channel axis c.

All dimensions of all input tensors must match in every axis except axis c. Let each input tensor have dimensions ai. The concatenated output tensor will have dimensions b such that

Conditions And Limitations

The default channel axis is assumed to be the third from last axis, or the first non-batch axis if there are fewer than 3 non-batch axes. Concatenation cannot be done along the batch axis. All input tensors must either be non-INT32 type or all must be INT32 type.

A.1.3. IConstantLayer

The IConstantLayer outputs a tensor with values provided as parameters to this layer, enabling the convenient use of constants in computations.

Layer Description

Given dimensions d and weight vector w, the constant layer will output a tensor B of dimensions d with the constant values in w. This layer takes no input tensor. The number of elements in the weight vector w is equal to the volume of d.

Conditions And Limitations

The output can be a tensor of zero to seven dimensions.

See the C++ IConstantLayer method or the Python IConstantLayer method for further details.

A.1.4. IConvolutionLayer

The IConvolutionLayer computes a 2D (channel, height, and width) convolution or 3D (channel, depth, height and width) convolution, with or without bias.
Note: The operation that the IConvolutionLayer performs is actually a correlation. Therefore, it is a consideration if you are formatting weights to import via an API, rather than via the NvCaffeParser library.

Layer Description: 2D convolution

Compute a cross-correlation with 2D filters on a 4D tensor A, of dimensions a, to produce a 4D tensor B, of dimensions b. The dimensions of B depend on the dimensions of A, the number of output maps m, kernel size r, symmetric padding p, stride s, dilation d, and dilated kernel size t = r+d(r-1), such that height and width are adjusted accordingly as follows:
  • b = [a0mb2 b3]
  • b2 = (a2+2p0-t0)/s0+1
  • b3 = (a3+2p1-t1)/s1+1
The kernel weights w and bias weights x (optional) for the number of groups g, are such that:
  • w is ordered according to shape [ma1/gr0r1]
  • x has length m
Let tensor K with dimensions k = [ma1/gt0t1] be defined as the zero-filled tensor, such that:
  • ki,j,hh,ll = wi,j,h,l
  • hh = {0 if h = 0, h + d0(h-1) otherwise}, and ll = {0 if l = 0, l + d1(l-1) otherwise}.
and tensor C the zero-padded copy of A with dimensions [a0a1a2+p1], then tensor B is defined as: where kk = k+t0-1, and ll = l+t1-1.

Layer Description: 3D convolution

Compute a cross-correlation with 3D filters on a 5D tensor A, of dimensions a, to produce a 5D tensor B, of dimensions b. The dimensions of B depend on the dimensions of A, the number of output maps m, kernel size r, symmetric padding p, stride s, dilation d, and dilated kernel size t = r+d(r-1), such that height and width are adjusted accordingly as follows:
  • b = [a0mb2 b3 b4]
  • b2 = (a2+2p0-t0)/s0+1
  • b3 = (a3+2p1-t1)/s1+1
  • b4 = (a4+2p2-t2)/s2+1
The kernel weights w and bias weights x (optional) for the number of groups g, are such that:
  • w is ordered according to shape [ma1/gr0r1r2]
  • x has length m
Let tensor K with dimensions k = [ma1/gt0t1t2] be defined as the zero-filled tensor, such that:
  • ki,j,dd,hh,ll = wi,j,d,h,l
  • dd = {0 if d = 0, d + d0(d-1) otherwise}, hh = {0 if h = 0, h + d1(h-1) otherwise}, and ll = {0 if l = 0, l + d2(l-1) otherwise}.
and tensor C the zero-padded copy of A with dimensions [a0a1a2+p0a3+p1a4+p2], then tensor B is defined as: where dd = d+t0-1, kk = k+t1-1, and ll = l+t2-1.

Conditions And Limitations

2D or 3D is determined by number of input kernel dimensions. For 2D convolution, input and output may have more than 4 dimensions; beyond 4, all dimensions are treated as multipliers on the batch size, and input and output are treated as 4D tensors. For 3D convolution, similar with 2D convolution, if input or output has more than 5 dimensions, all dimensions beyond 5 are treated as multipliers on the batch size. If groups are specified and INT8 data type is used, then the size of the groups must be a multiple of 4 for both input and output.

A.1.5. IDeconvolutionLayer

The IDeconvolutionLayer computes a 2D (channel, height, and width) or 3D (channel, depth, height and width) deconvolution, with or without bias.
Note: This layer actually applies a 2D/3D transposed convolution operator over a 2D/3D input. It is also known as fractionally-strided convolution or transposed convolution.

Layer Description: 2D deconvolution

Compute a cross-correlation with 2D filters on a 4D tensor A, of dimensions a, to produce a 4D tensor B, of dimensions b. The dimensions of B depend on the dimensions of A, the number of output maps m, kernel size r, symmetric padding p, stride s, dilation d, and dilated kernel size t = r+d(r-1), such that height and width are adjusted accordingly as follows:
  • b = [a0mb2 b3]
  • b2 = ( a2-1)*s0 + t0 - 2p0
  • b3 = ( a3-1)*s1 + t1 - 2p1
The kernel weights w and bias weights x (optional) for the number of groups g, are such that:
  • w is ordered according to shape [a1/gmr0r1]
  • x has length m
Let tensor K with dimensions k = [mb1/gt0t1] be defined as the zero-filled tensor, such that:
  • ki,j,hh,ll = wi,j,h,l
  • hh = {0 if h = 0, h + d0(h-1) otherwise}, and ll = {0 if l = 0, l + d1(l-1) otherwise}.
and tensor C the zero-padded copy of A with dimensions [a0a1a2+p1], then tensor B is defined as: where u ranges from 0 to min(t0-1, k), and v ranges from 0 to min(t1-1, l).

Layer Description: 3D deconvolution

Compute a cross-correlation with 3D filters on a 5D tensor A, of dimensions a, to produce a 5D tensor B, of dimensions b. The dimensions of B depend on the dimensions of A, the number of output maps m, kernel size r, symmetric padding p, stride s, dilation d, and dilated kernel size t = r+d(r-1), such that height and width are adjusted accordingly as follows:
  • b = [a0mb2 b3]
  • b2 = ( a2-1)*s0 + t0 - 2p0
  • b3 = ( a3-1)*s1 + t1 - 2p1
  • b4 = ( a4-1)*s2 + t2 - 2p2
The kernel weights w and bias weights x (optional) for the number of groups g, are such that:
  • w is ordered according to shape [a1/gmr0r1r2]
  • x has length m
Let tensor K with dimensions k = [mb1/gt0t1t2] be defined as the zero-filled tensor, such that:
  • ki,j,dd,hh,ll = wi,j,d,h,l
  • dd = {0 if d = 0, d + d0(d-1) otherwise}, hh = {0 if h = 0, h + d1(h-1) otherwise}, and ll = {0 if l = 0, l + d2(l-1) otherwise}.
and tensor C the zero-padded copy of A with dimensions [a0a1a2+p0a3+p1a4+p2], then tensor B is defined as: where u ranges from 0 to min(t0-1, k), and v ranges from 0 to min(t1-1, l), and w ranges from 0 to min(t2-1, m).

Conditions And Limitations

2D or 3D is determined by number of input kernel dimensions. For 2D deconvolution, input and output may have more than 4 dimensions; beyond 4, all dimensions are treated as multipliers on the batch size, and input and output are treated as 4D tensors. For 3D deconvolution, similar with 2D deconvolution, dimensions beyond 5 are treated as multipliers on the batch size. If groups are specified and INT8 data type is used, then the size of the groups must be a multiple of 4 for both input and output.
Note: Dilated deconvolution is not supported in TensorRT.

A.1.6. IElementWiseLayer

The IElementWiseLayer, also known as the Eltwise layer, implements per-element operations.

Layer Description

This layer computes a per-element binary operation between input tensor A and input tensor B to produce an output tensor C. For each dimension, their lengths must match, or one of them must be one. In the latter case, the tensor is broadcast along that axis. The output tensor has the same number of dimensions as the inputs. For each dimension, its length is the maximum of the lengths of the corresponding input dimension.

The IElementWiseLayer supports the following operations:
Sum: C = A+B
Product: C = A*B
Minimum: C = min(A, B)
Maximum: C = max(A, B)
Subtraction: C = A-B
Division: C = A/B
Power: C = A^B

Conditions And Limitations

The length of each dimension of the two input tensors A and B must be equal or equal to one.

A.1.6.1. ElementWise Layer Setup

The ElementWise layer is used to execute the second step of the functionality provided by a FullyConnected layer. The output of the fcbias Constant layer and Matrix Multiplication layer are used as inputs to the ElementWise layer. The output from this layer is then supplied to the TopK layer. The code below demonstrates how to setup the layer:
C++ code snippet
auto fcbias = network->addConstant(Dims2(VOCAB_SIZE, 1), weightMap[FCB_NAME]);
auto addBiasLayer = network->addElementWise(
*matrixMultLayer->getOutput(0), 
*fcbias->getOutput(0), ElementWiseOperation::kSUM);
assert(addBiasLayer != nullptr);
addBiasLayer->getOutput(0)->setName("Add Bias output");
Python code snippet
fc_bias = network.add_constant((VOCAB_SIZE, 1), weightMap[FCB_NAME])
add_bias_layer = network.add_elementwise(
matrix_mult_layer.get_output(0), 
fc_bias.get_output(0), trt.ElementWiseOperation.SUM)
assert add_bias_layer != None
add_bias_layer.get_output(0).name = "Add Bias output"

For more information, see the TensorRT API documentation.

A.1.7. IFullyConnectedLayer

The IFullyConnectedLayer implements a matrix-vector product, with or without bias.

Layer Description

The IFullyConnectedLayer expects an input tensor A of three or more dimensions. Given an input tensor A of dimensions a=[a0 ... an-1], it is first reshaped into a tensor A’ of dimensions a’=[a0 ... an-4 (an-3*an-2*an-1)] by squeezing the last three dimensions into one dimension.

Then, the layer performs the operation B’=WA’+X where W is the weight tensor of dimensions w=[(an-3*an-2*an-1) k], X is the bias tensor of dimensions x=[k] broadcasted along the other dimensions, and k is the number of output channels, configurable via setNbOutputChannels(). If X is not specified, the value of the bias is implicitly 0. The resulting B’ is a tensor of dimensions b’=[a0 ... an-4k].

Finally, B’ is reshaped again into the output tensor B of dimensions b=[a0 ... an-4k 1 1] by inserting two lower dimensions each of size 1.

In summary, for input tensor A of dimensions a=[a0 ... an-1], the output tensor B will have dimensions b=[a0 ... an-4k 1 1].

Conditions And Limitations

A must have three dimensions or more.

A.1.8. IGatherLayer

The IGatherLayer implements the gather operation on a given axis.

Layer Description

The IGatherLayer gathers elements of each data tensor A along the specified axisxusing indices tensor B of zero dimensions or more dimensions, to produce output tensor C of dimensions c.

If B has zero dimensions and it is a scalar b, then ck={ak if k<x, and ak+1 if k>x} and c has length equal to one less than the length of a. In this case, Ci=Aj where jk={b if k=x, ik if k<x, and ik-1 if k>x}.

If B is a tensor of dimensions b (with length b), then ck={ak if k<x, bk-x if k≥x and k<x+b, and ak-b+1 otherwise}. In this case, Ci=Aj where jk={BX(i) if k=x, ik if k<x, and ik-b if k>x} and X(i)=ix,..,x+b-1.

Conditions And Limitations

Elements cannot be gathered along the batch size dimension. The data tensor A must contain at least one non-batch dimension. The data tensor A must contain at least axis + 1 non-batch dimensions. The indices tensor B must contain only INT32 values. The parameter axis is zero-indexed and starts at the first non-batch dimension of data tensor A. If there are any invalid indices elements in the indices tensor, then zeros will be stored at the appropriate locations in the output tensor.

See the C++ IGatherLayer method or the Python IGatherLayer method for further details.

A.1.9. IIdentityLayer

The IIdentityLayer implements the identity operation.

Layer Description

The output of the layer is mathematically identical to the input. This layer allows you to precisely control the precision of tensors and transform from one precision to another. If the input is at a different precision than the output, the layer will convert the input tensor into the output precision.

Conditions And Limitations

None

See the C++ IIdentityLayer method or the Python IIdentityLayer method for further details.

A.1.10. ILRNLayer

The ILRNLayer implements cross-channel Local Response Normalization (LRN).

Layer Description

Given an input A, the LRN layer performs a cross-channel Local Response Normalization to produce output Bof the same dimensions.. The operation of this layer depends on 4 constant values: w is the size of the cross-channel window over which the normalization will occur, α, β, and k are normalization parameters. The formula below describes the operation performed by the layer:

Where I represents the indexes of tensor elements, and j(I) the indices where the channel dimension is replaced by j. For channel index c of Cchannels, index j ranges from max(0, c-w) and min(C-1, c+w).

Conditions And Limitations

A must have 3 or more dimensions. The following list shows the possible values for the parameters:
  • w ∈{1, 3, 5, 7, 9, 11, 13, 15}
  • α ∈[-1 x 1020, 1 x 1020]
  • β ∈[0.01, 1 x 105]
  • k ∈[1 x 10-5, 1 x 1010]

See the C++ ILRNLayer method or the Python ILRNLayer method for further details.

A.1.11. IMatrixMultiplyLayer

The IMatrixMultiplyLayer implements matrix multiplication for a collection of matrices.

Layer Description

The IMatrixMultiplyLayer computes the matrix multiplication of input tensors A, of dimensions a, and B, of dimensions b, and produces output tensor C, of dimensions c. A, B, and C all have the same rank n≥2. If n>2, then A, B, and C are treated as collections of matrices; A and B may be optionally transposed (the transpose is applied to the last two dimensions). Let AI and BI be the input tensors after the optional transpose, then Ci0,..,in-3,:,:=AIi0,..,in-3,:,:*BIi0,..,in-3,:,:.

Given the corresponding dimensions aI and bI of AI and BI, then ci={max(ai,bi) if i<n-2,aIi if i=n-2, and bIi if i=n-1}; that is the resulting collection has the same number of matrices as the input collections, and the rows and columns correspond to the rows in AI and the columns in BI. Notice also the use of max in the lengths, for the case of broadcast on a dimension.

Conditions And Limitations

Tensors A and B must have at least two dimensions, and agree on the number of dimensions. The length of each dimension must be the same, assuming that dimensions of length one are broadcast to match the corresponding length.

A.1.11.1. MatrixMultiply Layer Setup

The Matrix Multiplication layer is used to execute the first step of the functionality provided by a FullyConnected layer. As shown in the code below, a Constant layer will need to be used so that the FullyConnected weights can be stored in the engine. The output of the Constant and RNN layers are then used as inputs to the Matrix Multiplication layer. The RNN output is transposed so that the dimensions for the MatrixMultiply are valid.
C++ code snippet
weightMap["trt_fcw"] = transposeFCWeights(weightMap[FCW_NAME]);
auto fcwts = network->addConstant(Dims2(VOCAB_SIZE, HIDDEN_SIZE), weightMap["trt_fcw"]);
auto matrixMultLayer = network->addMatrixMultiply(
*fcwts->getOutput(0), false, *rnn->getOutput(0), true);
assert(matrixMultLayer != nullptr);
matrixMultLayer->getOutput(0)->setName("Matrix Multiplication output");
Python code snippet
weight_map["trt_fcw"] = transpose_fc_weights(weight_map[FCW_NAME])
fc_wts = network.add_constant((VOCAB_SIZE, HIDDEN_SIZE), weight_map["trt_fcw"])
matrix_mult_layer = network.add_matrix_multiply(
fc_wts.get_output(0), trt.MatrixOperation.NONE,  rnn.get_output(0), trt.MatrixOperation.TRANSPOSE)
assert matrix_mult_layer != None
matrix_mult_layer.get_output(0).name = 
"Matrix Multiplication output"

For more information, see the TensorRT API documentation.

A.1.12. IPaddingLayer

The IPaddingLayer implements spatial zero-padding of tensors along the two innermost dimensions.

Layer Description

The IPaddingLayer pads zeros to (or trims edges from) an input tensor A along each of the two innermost dimensions and gives the output tensor B. Padding can be different on each dimension, asymmetric, and can be either positive (resulting in expansion of the tensor) or negative (resulting in trimming). Padding at the beginning and end of the two dimensions is specified by 2D vectors x and y, for pre and post padding respectively.

For input tensor A of n dimensions a, the output B will have n dimensions b such that bi={x0+an-2+y0 if i=n-2; x1+an-1+y1 if i=n-1; and ai otherwise}. Accordingly, the values of Bw are zeros if wn-2<x0 or x0+an-2 wn-2 or wn-1<x1 or x1+an-2wn-1 . Otherwise, Bw=Az where zn-2=wn-2+x0, zn-1=wn-1+x1, and zi=wi for all other dimensions i.

Conditions And Limitations

  • A must have three dimensions or more.
  • The padding can only be applied along the two innermost dimensions.
  • Only zero-padding is supported.

See the C++ IPaddingLayer method or the Python IPaddingLayer method for further details.

A.1.13. IPluginLayer

The IPluginLayer is user-defined and provides the ability to extend the functionalities of TensorRT. See Extending TensorRT With Custom Layers for more details.

See the C++ IPluginLayer method or the Python IPluginLayer method for further details.

A.1.14. IPoolingLayer

The IPoolingLayer implements pooling within a channel. Supported pooling types are maximum, average and maximum-average blend.

Layer Description: 2D pooling

Compute a pooling with 2D filters on a tensor A, of dimensions a, to produce a tensor B, of dimensions b. The dimensions of B depend on the dimensions of A, window size r, symmetric padding p and stride s such that:
  • b = [a0a1...an-3bn-2bn-1]
  • bn-2 = (an-2+2p0-r0)/s0+1
  • bn-1 = (an-1+2p1-r1)/s1+1

Let tensor C be the zero-padded copy of A with dimensions [a0a1... an-2+2p0an-1+2p1]then, Bj......kl= func(Cj…. k:kk l:ll) where kk = k+r0-1, and ll = l+r1-1.

Where func is defined by one of the pooling types t:
PoolingType::kMAX
Maximum over elements in window.
PoolingType::kAVERAGE
Average over elements in the window.
PoolingType::kMAX_AVERAGE_BLEND
Hybrid of maximum and average pooling. The results of the maximum pooling and the average pooling are combined with the blending factor as (1-blendFactor)*maximumPoolingResult + blendFactor*averagePoolingResult to yield the result. The blendFactor can be set to a value between 0 and 1.

By default, average pooling is performed on the overlap between the pooling window and the padded input. If the exclusive parameter is set to true, the average pooling is performed on the overlap area between the pooling window and unpadded input.

Layer Description: 3D pooling

Compute a pooling with 3D filters on a tensor A, of dimensions a, to produce a tensor B, of dimensions b. The dimensions of B depend on the dimensions of A, window size r, symmetric padding p and stride s such that:
  • b = [a0a1...an-4bn-3bn-2bn-1]
  • bn-3 = (an-3+2p0-r0)/s0+1
  • bn-2 = (an-2+2p1-r1)/s1+1
  • bn-1 = (an-1+2p2-r2)/s2+1

Let tensor C be the zero-padded copy of A with dimensions [a0a1... an-3+2p0an-2+2p1an-1+2p2]then, Bj......klm= func(Cj…. k:kk l:ll m:mm) where kk = k+r0-1, ll = l+r1-1, and mm = m+r2-1.

Where func is defined by one of the pooling types t:
PoolingType::kMAX
Maximum over elements in window.
PoolingType::kAVERAGE
Average over elements in the window.
PoolingType::kMAX_AVERAGE_BLEND
Hybrid of maximum and average pooling. The results of the maximum pooling and the average pooling are combined with the blending factor as (1-blendFactor)*maximumPoolingResult + blendFactor*averagePoolingResult to yield the result. The blendFactor can be set to a value between 0 and 1.

By default, average pooling is performed on the overlap between the pooling window and the padded input. If the exclusive parameter is set to true, the average pooling is performed on the overlap area between the pooling window and unpadded input.

Conditions And Limitations

2D or 3D is determined by number of input kernel dimensions. For 2D pooling, input and output tensors should have 3 or more dimensions. For 3D pooling, input and output tensors should have 4 or more dimensions.

See the C++ IPoolingLayer method or the Python IPoolingLayer method for further details.

A.1.15. IRaggedSoftMaxLayer

The IRaggedSoftMaxLayer applies the SoftMax function on an input tensor of sequences across the sequence lengths specified by the user.

Layer Description

This layer has two inputs: a 2D input tensor A of shape zs containing z sequences of data and a 1D bounds tensor B of shape z containing the lengths of each of the z sequences in A. The resulting output tensor C has the same dimensions as the input tensor A.

The SoftMax function S is defined on every i of the z sequences of data values Ai,0:Bi just like in the SoftMax layer.

Conditions And Limitations

None

A.1.16. IReduceLayer

The IReduceLayer implements dimension reduction of tensors using reduce operators.

Layer Description

The IReduceLayer computes a reduction of input tensor A, of dimensions a, to produce an output tensor B, of dimensions b, over the set of reduction dimensions r. The reduction operator op is one of max, min, product, sum, and average. The reduction can preserve the number of dimensions of A or not. If the dimensions are kept, then bi={1 if i∈r, and ai otherwise}; if the dimensions are not kept, then bj-m(j)=aj where j∈r and m(j) is the number of reduction indexes in r less than or equal to j.

With the sequence of indexes i, Bi=op(Aj), where the sequence of indexes j is such that jk={: if k∈r, and ik otherwise}.

Conditions And Limitations

The input must have at least one non-batch dimension. The batch size dimension cannot be reduced.

See the C++ IReduceLayer method or the Python IReduceLayer method for further details.

A.1.17. IResizeLayer

The IResizeLayer implements the resize operation on an input tensor.

Layer Description

The IResizeLayer resizes input tensor A, of dimension a, to produce an output tensor B, of dimensions b, using a given resize mode m. Output dimension b can either be provided directly or can be computed using resize scales s. If resize scales s are provided, bi={floor(ai * si)}.

Interpolation mode such as Nearest and Linear are supported for the resize operation. The nearest mode resizes innermost d dimensions of N-D tensors, where d∈ (0, min(8, N) and N > 0. The linear mode resizes innermost d dimensions of N-D tensors, where d ∈ (0, min(3, N) and N > 0. The resize operation can also be configured to align corners while interpolating.

Conditions And Limitations

Either output dimension b or resize scales s must be known and valid. Number of scales must be equal to the number of input dimensions. Number of output dimensions must be equal to the number of input dimensions.

See the C++ IResizeLayer method or the Python IResizeLayer method for further details.

A.1.18. IRNNLayer

The IRNNLayer is deprecated in favor of IRNNv2Layer, however, it is still available for backwards compatibility.

Layer Description

This layer is identical to the IRNNv2Layer in functionality, but contains additional limitations as described in the Conditions and Limitations section.

Conditions And Limitations

Unlike the IRNNv2Layer, the legacy IRNNLayer does not support specifying sequence lengths via an input tensor.

The legacy IRNNLayer does not support arbitrary batch dimensions, and requires that input tensor data be specified using the dimension ordering: sequence length T, batch size N, embedding size E. In contrast, the IRNNv2Layer requires that tensor data be specified using the dimension ordering: batch size N, sequence length T, embedding size E.

All limitations that apply to the IRNNv2Layer also apply to the legacy RNN layer.

See the C++ IRNNLayer method or the Python IRNNLayer method for further details.

A.1.19. IRNNv2Layer

The IRNNv2Layer implements recurrent layers such as Recurrent Neural Network (RNN), Gated Recurrent Units (GRU), and Long Short-Term Memory (LSTM). Supported types are RNN, GRU, and LSTM. It performs a recurrent operation, where the operation is defined by one of several well-known recurrent neural network (RNN) "cells".

Layer Description

This layer accepts an input sequence X,initial hidden state H0, and if the cell is a long short-term memory (LSTM) cell, initial cell state C0, and produces an output Y which represents the output of the final RNN "sub-layer" computed across T timesteps (see below). Optionally, the layer can also produce an output hT representing the final hidden state, and, if the cell is an LSTM cell, an output cT representing the final cell state.

Let the operation of the cell be defined as the function G(x, h, c). This function takes vector inputs x, h, and c, and produces up to two vector outputs, h and c, representing the hidden and cell state after the cell operation has been performed.

In the default (unidirectional) configuration, the RNNv2 layer applies Gas shown in the following diagram:

G’ is a variant of G,.

Arrows leading into boxes are function inputs, and arrows leading away from boxes are function outputs. X = [x0, x1, …, xT], Y = [y0, y1, …, yT], Hi= [hi,0, hi,1, …, hi,L], and Ci= [ci,0, ci,1, …, ci,L].

The gray c edges are only present if the RNN is using LSTM cells for G and G’.
Note: The above construction has L "sub-layers" (horizontal rows of G), and the matrices Hi and Ci have dimensionality L.

Optionally, the sequence length Tmay be specified as an input to the RNNv2 layer, allowing the client to specify a batch of input sequences with different lengths.

Bidirectional RNNs (BiRNNs): The RNN can be configured to be bidirectional. In that case, each sub-layer consists of a "forward" layer and "backward" layer. The forward layer iteratively applies G using xi from 0 to T, and the backward layer iteratively applies G using xi from T to 0, as shown in the diagram below:

Black bars in the diagram above represent concatenation. The full hidden state ht is defined by the concatenation of the forward hidden state htf and the backward hidden state htb:
  • ht,i = [ htf,i ,htb,i]
  • ht= [ ht,0,, ht,1, …,ht,L].

Similarly, for the cell state (not shown). Each ht,i is used as input to the next sub-layer, as shown above.

RNN operations: The RNNv2 layer supports the following cell operations:
  • ReLU: G(x, h, c) := max(Wix + Rih + Wb + Rb, 0) (c not used)
  • tanh: G(x, h, c) := tanh(Wix + Rih + Wb + Rb) (c not used)
  • GRU:
    • Z := sigmoid(Wzx + Rzh + Wbz + Rbz)
    • M := sigmoid(Wrx + Rrh + Wbr + Rbr)
    • G(x, h, c) := tanh(Whx + M(h + Rbh) + Wbh) (c not used)
  • LSTM:
    • I := sigmoid(WIx + RIh + Wbi + Rbi)
    • F := sigmoid(Wfx + Rfh + Wbf + Rbf)
    • O := sigmoid(Wox + Roh + Wbo + Rbo)
    • C := tanh(WCx + RCh + Wbc + Rbc)
    • C’ := F × C
    • H := O x tanh(C’)
    • G(x, h, c) := { H, C’ }
For GRU and LSTM, we refer to the intermediate computations for Z, M, I, F, etc. as "gates".

In the unidirectional case, the dimensionality of the W matrices is HxE for the first layer and HxH for subsequent layers (unless skip mode is set, see below). In the bidirectional case, the dimensionality of the W matrices is HxE for the first forward/backward layer, and Hx2H for subsequent layers.

The dimensionality of the R matrices is always HxH. The biases Wbx and Rbx have dimensionality H.

Skip mode: The default mode used by RNNv2 is "linear mode". In this mode, the first sub-layer of the RNNv2 layer uses the cell G’(x, h, c), which accepts a vector x of size E (embedding size), and vectors h and c of size H (hidden state size), and is defined by the cell operation formula. Subsequent layers use the cell G(x, h, c), where x, h, and c are all vectors of size H, and is also defined by the cell operation formula.

Optionally, the RNN can be configured to run in "skip mode", which means the input weight matrices for the first layer are implicitly identity matrices, and x Is expected to be size H.

Conditions And Limitations

The data (X) input and initial hidden/cell state (H0 and C0) tensors have at least 2 non-batch dimensions. Additional dimensions are considered batch dimensions.

The optional sequence length input T is 0-dimensional (scalar) when excluding batch dimensions.

The data (Y) output and final hidden/cell state (HT and CT) tensors have at least 2 non-batch dimensions. Additional dimensions are considered batch dimensions. If the sequence length input is provided, each output in the batch is padded to the maximum sequence length Tmax.

The IRNNv2Layer supports:
  • FP32 and FP16 data type for input and output, hidden, and cell tensors.
  • INT32 data type only for the sequence length tensor.
After the network is defined, you can mark the required outputs. RNNv2 output tensors that are not marked as network outputs or used as inputs to another layer are dropped.
network->markOutput(*pred->getOutput(1));
pred->getOutput(1)->setType(DataType::kINT32);
rnn->getOutput(1)->setName(HIDDEN_OUT_BLOB_NAME);
network->markOutput(*rnn->getOutput(1));
if (rnn->getOperation() == RNNOperation::kLSTM)
{
rnn->getOutput(2)->setName(CELL_OUT_BLOB_NAME);
network->markOutput(*rnn->getOutput(2));
};

See the C++ IRNNv2Layer method or the Python IRNNv2Layer method for further details.

A.1.19.1. RNNv2 Layer Setup

The first layer in the network is an RNN layer. This is added and configured in the addRNNv2Layer() function. This layer consists of the following configuration parameters:
Operation
This defines the operation of the RNN cell. Supported operations are currently relu, LSTM, GRU, and tanh.
Direction
This defines whether the RNN is unidirectional or bidirectional (BiRNN).
Input mode
This defines whether the first layer of the RNN carries out a matrix multiply (linear mode), or the matrix multiply is skipped (skip mode).
For example, in the network used in sampleCharRNN, we used a linear, unidirectional LSTM cell containing LAYER_COUNT number of stacked layers. The code below shows how to create this RNNv2 layer.
auto rnn = network->addRNNv2(*data, LAYER_COUNT, HIDDEN_SIZE, SEQ_SIZE, RNNOperation::kLSTM);
Note: For the RNNv2 layer, weights and bias need to be set separately. For more information, see RNNv2 Layer - Optional Inputs.

For more information, see the TensorRT API documentation.

A.1.19.2. RNNv2 Layer - Optional Inputs

If there are cases where the hidden and cell states need to be pre-initialized to a non-zero value, then you can pre-initialize them via the setHiddenState and setCellState calls. These are optional inputs to the RNN.
C++ code snippet
rnn->setHiddenState(*hiddenIn);
if (rnn->getOperation() == RNNOperation::kLSTM)
    rnn->setCellState(*cellIn);
Python code snippet
rnn.hidden_state = hidden_in
if rnn.op == trt.RNNOperation.LSTM:
rnn.cell_state = cell_in

A.1.20. IScaleLayer

The IScaleLayer implements a per-tensor, per-channel, or per-element affine transformation and/or exponentiation by constant values.

Layer Description

Given an input tensor A, the IScaleLayer performs a per-tensor, per-channel or per-element transformation to produce an output tensor B of the same dimensions. The transformations corresponding to each mode are:
ScaleMode::kUNIFORM tensor-wise transformation
B = (A * scale + shift)power
ScaleMode::kCHANNEL channel-wise transformation
BI = (AI * scalec(I) + shiftc(I))powerc(I)
ScaleMode::kELEMENTWISE element-wise transformation
BI = (AI * scale1 + shift1)power1
Where I represents the indexes of tensor elements and c(I) is the channel dimension in I.

Conditions And Limitations

A must have 3 or more dimensions.

If an empty weight object is provided for scale, shift, or power, then a default value is used. By default, scale has a value of 1.0, shift has a value of 0.0, and power has a value of 1.0.

See the C++ IScaleLayer method or the Python IScaleLayer method for further details.

A.1.21. IShapeLayer

The IShapeLayer gets the shape of a tensor.

Layer Description

The IShapeLayer outputs the dimensions of its input tensor. The output is a 1D tensor of type INT32.

Conditions And Limitations

The input tensor must have at least one dimension. The output tensor is a “shape tensor”, which can be used as an input only for layers that handle shape tensors. See Execution Tensors vs. Shape Tensors in the TensorRT Developer Guide for more information.

See the C++ IShapeLayer method or the Python IShapeLayer method for further details.

A.1.22. IShuffleLayer

The IShuffleLayer implements a reshape and transpose operator for tensors.

Layer Description

The IShuffleLayer implements reshuffling of tensors to permute the tensor and/or reshape it. An input tensor A of dimensions a is transformed by applying a transpose, followed by a reshape operation with reshape dimensions r, and then followed by another transpose operation to produce an output data tensor B of dimensions b.

To apply the transpose operation to A, the permutation order must be specified. The specified permutation p1 is used to permute the elements of A in the following manner to produce output C of dimensions c, such that ci=ap1(i) and CI=Ap1(I) for a sequence of indexes I. By default, the permutation is assumed to be an identity (no change to the input tensor).

The reshape operation does not alter the order of the elements, and reshapes tensor C into tensor R of shape rI, such that rIi={ri if ri>0, ci if ri=0, inferred if ri=-1}. Only one dimension can be inferred, such that rIi=ai.

The reshape dimensions can be specified as build-time constants in the layer or as runtime values by supplying a second input to the layer, which must be a 1D tensor of type INT32.

The second transpose operation is applied after the reshape operation. It follows the same rules as the first transpose operation and requires a permutation (say p2) to be specified. This permutation produces an output tensor B of dimensions b, such that bi=rp2(i) and Bp2(I)=RI for a sequence of indexes I.

Conditions And Limitations

Product of dimensions rI must be equal to the product of input dimensions a.

See the C++ IShuffleLayer method or the Python IShuffleLayer method for further details.

A.1.23. ISliceLayer

The ISliceLayer implements a slice operator for tensors.

Layer Description

Giving an input n-dimension (excluding batch dimension) tensor A, the Slice layer generates an output tensor B with elements extracted from A. The correspondence between element coordinates in A and B is given by: ai = bi*si+oi (0 ≤ i < n), where a, b, s, o are element coordinates in A, element coordinates in B, stride and starting offset, respectively. The stride can be positive, negative or zero.

Conditions And Limitations

The corresponding A coordinates for every element in B must not be out-of-bounds.

See the C++ ISliceLayer method or the Python ISliceLayer method for further details.

A.1.24. ISoftMaxLayer

The ISoftMaxLayer applies the SoftMax function on the input tensor along an input dimension specified by the user.

Layer Description

Given an input tensor A of shape a and an input dimension i, this layer applies the SoftMax function on every slice Aa0, …, ai-1, :, ai+1, …, an-1 along dimension i of A. The resulting output tensor C has the same dimensions as the input tensor A.

The SoftMax function S for a slice x is defined as:

The SoftMax function rescales the input such that every value in the output lies in the range [0, 1] and the values of every slice Ca0, …, ai-1, :, ai+1, …, an-1 along dimension i of C sum up to 1.

Conditions And Limitations

For n being the length of a, the input dimension i should be i∈[0,n-1]. If the user does not provide an input dimension, then i=max(0,n-3).

See the C++ ISoftMaxLayer method or the Python ISoftMaxLayer method for further details.

A.1.25. ITopKLayer

The ITopKLayer finds the top K maximum (or minimum) elements along a dimension, returning a reduced tensor and a tensor of index positions.

Layer Description

For an input tensor A of dimensions a, given an axis i, an operator that is either max or min, and a value for k, produces a tensor of values V and a tensor of indices I of dimensions v such that vj={k if i≠j, and ai otherwise}.

The output values are:
  • Va0, …, ai-1, :, ai+1, …,an = sort(Aa0, …, ai-1, :, ai+1, …,an):K
  • Ia0, …, ai-1, :, ai+1, …,an = argsort(Aa0, …, ai-1, :, ai+1, …,an):K
where sort is in descending order for operator max and ascending order for operator min.

Ties are broken during sorting with lower index considered to be larger for operator max, and lower index considered to be smaller for operator min.

Conditions And Limitations

The K value must be 3840 or less. Only one axis can be searched to find the top K minimum or maximum values; this axis cannot be the batch dimension.

See the C++ ITopKLayer method or the Python ITopKLayer method for further details.

A.1.25.1. TopK Layer Setup

The TopK layer is used to identify the character that has the maximum probability of appearing next.
Note: The layer has two outputs. The first output is an array of the top K values. The second, which is of more interest to us, is the index at which these maximum values appear.
The code below sets up the TopK layer and assigns the OUTPUT_BLOB_NAME to the second output of the layer.
C++ code snippet
auto pred =  network->addTopK(*addBiasLayer->getOutput(0),                                     	 	nvinfer1::TopKOperation::kMAX, 1, reduceAxis);
assert(pred != nullptr);
pred->getOutput(1)->setName(OUTPUT_BLOB_NAME);
Python code snippet
pred = network.add_topk(add_bias_layer.get_output(0),                                     	 	trt.TopKOperation.MAX, 1, reduce_axis)
assert pred != None
pred.get_output(1).name = OUTPUT_BLOB_NAME

For more information, see the TensorRT API documentation.

A.1.26. IUnaryLayer

The IUnaryLayer supports pointwise unary operations.

Layer Description

The IUnaryLayer performs pointwise operations on input tensor Aresulting in output tensor Bof the same dimensions. The following functions are supported:
  • exp: B = e^A
  • abs: B = |A|
  • log: B = ln(A)
  • sqrt: B = √A (rounded to nearest even mode)
  • neg: B = -A
  • recip: B = 1 / A (reciprocal) in rounded to nearest even mode

Conditions And Limitations

Input and output can be zero to 7 dimensional tensors.

See the C++ IUnaryLayer method or the Python IUnaryLayer method for further details.

A.2. Data Format Descriptions

TensorRT supports different data formats. There are two aspects to consider: data type and layout.

Data type format

The data type is the representation of each individual value. Its size determines the range of values and the precision of the representation; which are FP32 (32-bit floating point, or single precision), FP16 (16-bit floating point, or half precision), INT32 (32-bit integer representation) and INT8 (8-bit representation).

Layout format

The layout format determines the ordering in which values are stored. Typically, batch dimensions are the leftmost dimensions, and the other dimensions refer to aspects of each data item such as C is channel, H is height, and W is width, in images. Ignoring batch sizes, which are always preceding these, C, H, and W are typically sorted as CHWindex.html#data-format-desc__fig1 or HWCindex.html#data-format-desc__fig2.

To enable faster computations, more formats are defined to pack together channel values and use reduced precision. For this reason, TensorRT also supports formats NC/2HW2 and NHWC8.

In NC/2HW2, pairs of channel values are packed together in each HxW matrix (with an empty value in the case of an odd number of channels). The result is a format in which the values of ⌈C/2⌉HxW matrices are pairs of values of two consecutive channels index.html#data-format-desc__fig3; notice that this ordering interleaves dimensions as values of channels that have stride 1 if they are in the same pair and stride 2xHxW otherwise.

In NHWC8, the entries of an HxW matrix include the values of all the channels index.html#data-format-desc__fig4. In addition, these values are packed together in ⌈C/8⌉ 8-tuples and C is rounded up to the nearest multiple of 8.

A.3. Command-Line Programs

A.3.1. trtexec

Included in the samples directory is a command line wrapper tool, called trtexec. trtexec is a tool to quickly utilize TensorRT without having to develop your own application. The trtexec tool has two main purposes:
  • It’s useful for benchmarking networks on random data.
  • It’s useful for generating serialized engines from models.

Benchmarking network: If you have a model saved as a UFF file, ONNX file, or if you have a network description in a Caffe prototxt format, you can use the trtexec tool to test the performance of running inference on your network using TensorRT. The trtexec tool has many options for specifying inputs and outputs, iterations for performance timing, precision allowed, and other options.

Serialized engine generation: If you generate a saved serialized engine file, you can pull it into another application that runs inference. For example, you can use the TensorRT Laboratory to run the engine with multiple execution contexts from multiple threads in a fully pipelined asynchronous way to test parallel inference performance. There are some caveats, for example, if you used a Caffe prototxt file and a model is not supplied, random weights are generated. Also, in INT8 mode, random weights are used, meaning trtexec does not provide calibration capability.

Refer to GitHub: trtexec/README.md file for detailed information about how to build this tool and examples of its usage.

A.4. ACKNOWLEDGEMENTS

TensorRT uses elements from the following software, whose licenses are reproduced below:

Google Protobuf

This license applies to all parts of Protocol Buffers except the following:
  • Atomicops support for generic gcc, located in src/google/protobuf/stubs/atomicops_internals_generic_gcc.h. This file is copyrighted by Red Hat Inc.
  • Atomicops support for AIX/POWER, located in src/google/protobuf/stubs/atomicops_internals_power.h. This file is copyrighted by Bloomberg Finance LP.
Copyright 2014, Google Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of Google Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Code generated by the Protocol Buffer compiler is owned by the owner of the input file used when generating it. This code is not standalone and requires a support library to be linked with it. This support library is itself covered by the above license.

Google Flatbuffers

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  1. Definitions.

    "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.

    "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.

    "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.

    "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.

    "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.

    "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.

    "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).

    "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.

    "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."

    "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.

  2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
  3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
  4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
    1. You must give any other recipients of the Work or Derivative Works a copy of this License; and
    2. You must cause any modified files to carry prominent notices stating that You changed the files; and
    3. You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
    4. If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.

      You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.

  5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
  6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
  7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
  8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
  9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.

END OF TERMS AND CONDITIONS

APPENDIX: How to apply the Apache License to your work.

To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.

Copyright 2014 Google Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

BVLC Caffe

COPYRIGHT

All contributions by the University of California:

Copyright (c) 2014, 2015, The Regents of the University of California (Regents) All rights reserved.

All other contributions:

Copyright (c) 2014, 2015, the respective contributors All rights reserved.

Caffe uses a shared copyright model: each contributor holds copyright over their contributions to Caffe. The project versioning records all such contribution and copyright details. If a contributor wants to further mark their specific copyright on a particular contribution, they should indicate their copyright solely in the commit message of the change when it is committed.

LICENSE

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

CONTRIBUTION AGREEMENT

By contributing to the BVLC/Caffe repository through pull-request, comment, or otherwise, the contributor releases their content to the license and copyright terms herein.

half.h

Copyright (c) 2012-2017 Christian Rau <rauy@users.sourceforge.net>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

jQuery.js

jQuery.js is generated automatically under doxygen.

In all cases TensorRT uses the functions under the MIT license.

CRC

policies, either expressed or implied, of the Regents of the University of California.
Note: The copyright of UC Berkeley's Berkeley Software Distribution ("BSD") source has been updated. The copyright addendum may be found at ftp://ftp.cs.berkeley.edu/pub/4bsd/README.Impt.License.Change and is

William Hoskins

Director, Office of Technology Licensing

University of California, Berkeley

getopt.c

Copyright (c) 2002 Todd C. Miller <Todd.Miller@courtesan.com>

Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Sponsored in part by the Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory, Air Force Materiel Command, USAF, under agreement number F39502-99-1-0512.

Copyright (c) 2000 The NetBSD Foundation, Inc.

All rights reserved.

This code is derived from software contributed to The NetBSD Foundation by Dieter Baron and Thomas Klausner.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, DALI, DIGITS, DGX, DGX-1, DGX-2, DGX Station, DLProf, Jetson, Kepler, Maxwell, NCCL, Nsight Compute, Nsight Systems, NvCaffe, PerfWorks, Pascal, SDK Manager, Tegra, TensorRT, TensorRT Inference Server, Tesla, TF-TRT, and Volta are trademarks and/or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

1 The TensorRT Safety 6.0.0 Release does not support the TensorRT 6.0.1 Python API.
2It is recommended to evaluate the calibration input or validate the previous layer outputs.