C++ API Documentation#

Attention

This is the TensorRT C++ API for the NVIDIA TensorRT library. The NVIDIA TensorRT C++ API allows developers to import, calibrate, generate and deploy networks using C++. Networks can be imported directly from ONNX. They may also be created programmatically by instantiating individual layers and setting parameters and weights directly.

This section illustrates the basic usage of the C++ API, assuming you start with an ONNX model. The sampleOnnxMNIST sample illustrates this use case in more detail.

The C++ API can be accessed through the header NvInfer.h and is in the nvinfer1 namespace. For example, a simple application might begin with:

#include “NvInfer.h”

using namespace nvinfer1;

Interface classes in the TensorRT C++ API begin with the prefix I, such as ILogger and IBuilder.

A CUDA context is automatically created the first time TensorRT calls CUDA if none exists before that point. However, creating and configuring the CUDA context yourself is generally preferable before the first call to TensorRT.

The code in this chapter does not use smart pointers to illustrate object lifetimes; however, their use is recommended with TensorRT interfaces.

The Build Phase#

To create a builder, you first must instantiate the ILogger interface. This example captures all warning messages but ignores informational messages:

class Logger : public ILogger
{
    void log(Severity severity, const char* msg) noexcept override
    {
        // suppress info-level messages
        if (severity <= Severity::kWARNING)
            std::cout << msg << std::endl;
    }
} logger;

You can then create an instance of the builder:

IBuilder* builder = createInferBuilder(logger);

Creating a Network Definition#

After the builder has been created, the first step in optimizing a model is to create a network definition. The network creation options are specified using a combination of flags OR-d together.

You can specify that the network should be considered strongly typed using the NetworkDefinitionCreationFlag::kSTRONGLY_TYPED flag. For more information, refer to the Strongly Typed Networks section.

Finally, create a network:

INetworkDefinition* network = builder->createNetworkV2(flag);

Creating a Network Definition from Scratch (Advanced)#

Instead of using a parser, you can define the network directly to TensorRT via the Network Definition API. This scenario assumes that the per-layer weights are ready in host memory to pass to TensorRT during the network creation.

This example creates a simple network with Input, Convolution, Pooling, MatrixMultiply, Shuffle, Activation, and Softmax layers. It also loads the weights into a weightMap data structure, which is used in the following code.

First, create the builder and network objects. Note that the logger is initialized using the logger.cpp file common to all C++ samples. The C++ sample helper classes and functions can be found in the common.h header file.

auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));
auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(0);

Add the Input layer to the network by specifying the input tensor’s name, datatype, and full dimensions. A network can have multiple inputs, although in this sample, there is only one:

auto data = network->addInput(INPUT_BLOB_NAME, datatype, Dims4{1, 1, INPUT_H, INPUT_W});

Add the Convolution layer with hidden layer input nodes, strides, and weights for filter and bias.

auto conv1 = network->addConvolution(
*data->getOutput(0), 20, DimsHW{5, 5}, weightMap["conv1filter"], weightMap["conv1bias"]);
conv1->setStride(DimsHW{1, 1});

Note

Weights passed to TensorRT layers are in host memory.

Add the Pooling layer; note that the output from the previous layer is passed as input.

auto pool1 = network->addPooling(*conv1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});
pool1->setStride(DimsHW{2, 2});

Add a Shuffle layer to reshape the input in preparation for matrix multiplication:

int32_t const batch = input->getDimensions().d[0];
int32_t const mmInputs = input.getDimensions().d[1] * input.getDimensions().d[2] * input.getDimensions().d[3];
auto inputReshape = network->addShuffle(*input);
inputReshape->setReshapeDimensions(Dims{2, {batch, mmInputs}});

Now, add a MatrixMultiply layer. The model exporter provided transposed weights, so the kTRANSPOSE option is specified.

IConstantLayer* filterConst = network->addConstant(Dims{2, {nbOutputs, mmInputs}}, mWeightMap["ip1filter"]);
auto mm = network->addMatrixMultiply(*inputReshape->getOutput(0), MatrixOperation::kNONE, *filterConst->getOutput(0), MatrixOperation::kTRANSPOSE);

Add the bias, which will broadcast across the batch dimension.

auto biasConst = network->addConstant(Dims{2, {1, nbOutputs}}, mWeightMap["ip1bias"]);
auto biasAdd = network->addElementWise(*mm->getOutput(0), *biasConst->getOutput(0), ElementWiseOperation::kSUM);

Add the ReLU Activation layer:

auto relu1 = network->addActivation(*ip1->getOutput(0), ActivationType::kRELU);

Add the SoftMax layer to calculate the final probabilities:

auto prob = network->addSoftMax(*relu1->getOutput(0));

Add a name for the output of the SoftMax layer so that the tensor can be bound to a memory buffer at inference time:

prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);

Mark it as the output of the entire network:

network->markOutput(*prob->getOutput(0));

The network representing the MNIST model has now been fully constructed. For instructions on how to build an engine and run an inference with this network, refer to the Building an Engine and Performing Inference sections.

For more information regarding layers, refer to the TensorRT Operator documentation.

Importing a Model Using the ONNX Parser#

Now, the network definition must be populated from the ONNX representation. The ONNX parser API is in the file NvOnnxParser.h, and the parser is in the nvonnxparser C++ namespace.

#include “NvOnnxParser.h”

using namespace nvonnxparser;

You can create an ONNX parser to populate the network as follows:

IParser* parser = createParser(*network, logger);

Then, read the model file and process any errors.

parser->parseFromFile(modelFile,
    static_cast<int32_t>(ILogger::Severity::kWARNING));
for (int32_t i = 0; i < parser->getNbErrors(); ++i)
{
std::cout << parser->getError(i)->desc() << std::endl;
}

An important aspect of a TensorRT network definition is that it contains pointers to model weights, which the builder copies into the optimized engine. Since the network was created using the parser, the parser owns the memory occupied by the weights, so the parser object should not be deleted until after the builder has run.

Building an Engine#

The next step is to create a build configuration specifying how TensorRT should optimize the model.

IBuilderConfig* config = builder->createBuilderConfig();

This interface has many properties that you can set to control how TensorRT optimizes the network. One important property is the maximum workspace size. Layer implementations often require a temporary workspace, and this parameter limits the maximum size that any layer in the network can use. If insufficient workspace is provided, TensorRT may not be able to find an implementation for a layer. By default, the workspace is set to the total global memory size of the given device; restrict it when necessary, for example, when multiple engines are to be built on a single device.

config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, 1U << 20);

Another significant consideration is the maximum shared memory allocation for the CUDA backend implementation. This allocation becomes pivotal in scenarios where TensorRT needs to coexist with other applications, such as when both TensorRT and DirectX concurrently utilize the GPU.

config->setMemoryPoolLimit(MemoryPoolType::kTACTIC_SHARED_MEMORY, 48 << 10);

Once the configuration has been specified, the engine can be built.

IHostMemory* serializedModel = builder->buildSerializedNetwork(*network, *config);

Since the serialized engine contains the necessary copies of the weights, the parser, network definition, builder configuration, and builder are no longer necessary and may be safely deleted:

delete parser;
delete network;
delete config;
delete builder;

The engine can then be saved to disk, and the buffer into which it was serialized can be deleted.

delete serializedModel

Note

Serialized engines are not portable across platforms. They are specific to the exact GPU model on which they were built (in addition to the platform).

Building engines is intended as an offline process, so it can take significant time. The Optimizing Builder Performance section has tips on making the builder run faster.

Deserializing a Plan#

When you have a previously serialized optimized model and want to perform inference, you must first create an instance of the Runtime interface. Like the builder, the runtime requires an instance of the logger:

IRuntime* runtime = createInferRuntime(logger);

TensorRT provides three main methods to deserialize an engine, each with its use case and benefits.

In-memory Deserialization

This method is straightforward and suitable for smaller models or when memory isn’t a constraint.

std::vector<char> modelData = readModelFromFile("model.plan");
ICudaEngine* engine = runtime->deserializeCudaEngine(modelData.data(), modelData.size());

IStreamReaderV2 Deserialization

This method allows for a more controlled reading of the engine file and is useful for custom file handling or weight streaming. It supports reading from both host and device pointers and enables potential performance improvements. With this approach, reading the entire plan file into a buffer to be deserialized is unnecessary, as IStreamReaderV2 allows reading the file in chunks as needed, and possibly bypassing the CPU, thus reducing peak CPU memory usage.

class MyStreamReaderV2 : public IStreamReaderV2 {
    // Custom implementation with support for device memory reading
};
MyStreamReaderV2 readerV2("model.plan");
ICudaEngine* engine = runtime->deserializeCudaEngine(readerV2);

The IStreamReaderV2 approach is particularly beneficial for large models or when using advanced features like GPUDirect or weight streaming. It can significantly reduce engine load time and memory usage.

When choosing a deserialization method, consider your specific requirements:

For small models or simple use cases, in-memory deserialization is often sufficient.
For large models or when memory efficiency is crucial, consider using IStreamReaderV2.
If you need custom file handling or weight streaming capabilities, IStreamReaderV2 provides the necessary flexibility.

Performing Inference#

The engine holds the optimized model, but you must manage additional states for intermediate activations to perform inference. This is done using the ExecutionContext interface:

IExecutionContext *context = engine->createExecutionContext();

An engine can have multiple execution contexts, allowing one set of weights to be used for multiple overlapping inference tasks. (A current exception to this is when using dynamic shapes when each optimization profile can only have one execution context unless the preview feature, kPROFILE_SHARING_0806, is specified.)

To perform inference, you must pass TensorRT buffers for input and output, which TensorRT requires you to specify with calls to setTensorAddress, which takes the tensor’s name and the buffer’s address. You can query the engine using the names you provided for input and output tensors to find the right positions in the array:

context->setTensorAddress(INPUT_NAME, inputBuffer);
context->setTensorAddress(OUTPUT_NAME, outputBuffer);

If the engine was built with dynamic shapes, you must also specify the input shapes:

context->setInputShape(INPUT_NAME, inputDims);

You can then call TensorRT’s method enqueueV3 to start inference using a CUDA stream:

context->enqueueV3(stream);

A network will be executed asynchronously or not, depending on the structure and features of the network. A non-exhaustive list of features that can cause synchronous behavior are data-dependent shapes, DLA usage, loops, and synchronous plugins. It is common to enqueue data transfers with cudaMemcpyAsync() before and after the kernels to move data from the GPU if it is not already there.

To determine when the kernels (and possibly cudaMemcpyAsync()) are complete, use standard CUDA synchronization mechanisms such as events or waiting on the stream.