C++ API Documentation#

Attention

This is the TensorRT-RTX C++ API for the NVIDIA TensorRT-RTX library. The NVIDIA TensorRT-RTX C++ API allows developers to import, generate, and deploy networks using C++. Networks can be imported directly from ONNX. They may also be created programmatically by instantiating individual layers and setting parameters and weights directly.

This section illustrates the basic usage of the C++ API, assuming you start with an ONNX model. Refer to the samples in the Tensor-RTX open source repository for more information.

The C++ API can be accessed through the header NvInfer.h and is in the nvinfer1 namespace. For example, a simple application might begin with:

#include “NvInfer.h”

using namespace nvinfer1;

Interface classes in the TensorRT-RTX C++ API begin with the prefix I, such as ILogger and IBuilder.

A CUDA context is automatically created the first time TensorRT-RTX calls CUDA if none exists before that point. However, creating and configuring the CUDA context yourself is generally preferable before the first call to TensorRT-RTX.

The code in this chapter does not use smart pointers to illustrate object lifetimes; however, their use is recommended with TensorRT-RTX interfaces.

The Build Phase#

To create a builder, you first must instantiate the ILogger interface. This example captures all warning messages but ignores informational messages:

class Logger : public ILogger
{
    void log(Severity severity, const char* msg) noexcept override
    {
        // suppress info-level messages
        if (severity <= Severity::kWARNING)
            std::cout << msg << std::endl;
    }
} logger;

You can then create an instance of the builder:

IBuilder* builder = createInferBuilder(logger);

Creating a Network Definition#

After the builder has been created, the first step in optimizing a model is to create a network definition. The network creation options are specified using a combination of flags OR-d together.

Finally, create a network:

INetworkDefinition* network = builder->createNetworkV2(flag);

Creating a Network Definition from Scratch (Advanced)#

Instead of using a parser, you can define the network directly to TensorRT-RTX via the Network Definition API. This scenario assumes that the per-layer weights are ready in host memory to pass to TensorRT-RTX during the network creation.

This example creates a simple network with Input, Convolution, Pooling, MatrixMultiply, Shuffle, Activation, and Softmax layers. It also loads the weights into a weightMap data structure, which is used in the following code.

First, create the builder and network objects.

std::unique_ptr<Logger> myLogger;
auto builder = std::unique_ptr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(&myLogger));
auto network = std::unique_ptr<nvinfer1::INetworkDefinition>(0);

Add the Input layer to the network by specifying the input tensor’s name, datatype, and full dimensions. A network can have multiple inputs, although in this sample, there is only one:

auto data = network->addInput(INPUT_BLOB_NAME, datatype, Dims4{1, 1, INPUT_H, INPUT_W});

Add the Convolution layer with hidden layer input nodes, strides, and weights for filter and bias.

auto conv1 = network->addConvolution(
*data->getOutput(0), 20, DimsHW{5, 5}, weightMap["conv1filter"], weightMap["conv1bias"]);
conv1->setStride(DimsHW{1, 1});

Note

Weights passed to TensorRT-RTX layers are in host memory.

Add the Pooling layer; note that the output from the previous layer is passed as input.

auto pool1 = network->addPooling(*conv1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});
pool1->setStride(DimsHW{2, 2});

Add a Shuffle layer to reshape the input in preparation for matrix multiplication:

int32_t const batch = input->getDimensions().d[0];
int32_t const mmInputs = input.getDimensions().d[1] * input.getDimensions().d[2] * input.getDimensions().d[3];
auto inputReshape = network->addShuffle(*input);
inputReshape->setReshapeDimensions(Dims{2, {batch, mmInputs}});

Now, add a MatrixMultiply layer. The model exporter provided transposed weights, so the kTRANSPOSE option is specified.

IConstantLayer* filterConst = network->addConstant(Dims{2, {nbOutputs, mmInputs}}, mWeightMap["ip1filter"]);
auto mm = network->addMatrixMultiply(*inputReshape->getOutput(0), MatrixOperation::kNONE, *filterConst->getOutput(0), MatrixOperation::kTRANSPOSE);

Add the bias, which will broadcast across the batch dimension.

auto biasConst = network->addConstant(Dims{2, {1, nbOutputs}}, mWeightMap["ip1bias"]);
auto biasAdd = network->addElementWise(*mm->getOutput(0), *biasConst->getOutput(0), ElementWiseOperation::kSUM);

Add the ReLU Activation layer:

auto relu1 = network->addActivation(*ip1->getOutput(0), ActivationType::kRELU);

Add the SoftMax layer to calculate the final probabilities:

auto prob = network->addSoftMax(*relu1->getOutput(0));

Add a name for the output of the SoftMax layer so that the tensor can be bound to a memory buffer at inference time:

prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);

Mark it as the output of the entire network:

network->markOutput(*prob->getOutput(0));

The network representing the MNIST model has now been fully constructed. For instructions on how to build a TensorRT-RTX engine and run an inference with this network, refer to the Building a TensorRT-RTX Engine and Performing Inference sections.

For more information regarding layers, refer to the TensorRT-RTX Operator documentation.

Importing a Model Using the ONNX Parser#

Now, the network definition must be populated from the ONNX representation. The ONNX parser API is in the file NvOnnxParser.h, and the parser is in the nvonnxparser C++ namespace.

#include “NvOnnxParser.h”

using namespace nvonnxparser;

You can create an ONNX parser to populate the network as follows:

IParser* parser = createParser(*network, logger);

Then, read the model file and process any errors.

parser->parseFromFile(modelFile,
    static_cast<int32_t>(ILogger::Severity::kWARNING));
for (int32_t i = 0; i < parser->getNbErrors(); ++i)
{
std::cout << parser->getError(i)->desc() << std::endl;
}

An important aspect of a TensorRT-RTX network definition is that it contains pointers to model weights, which the builder copies into the optimized engine. Since the network was created using the parser, the parser owns the memory occupied by the weights, so the parser object should not be deleted until after the builder has run.

Building a TensorRT-RTX Engine#

The next step is to create a build configuration specifying how TensorRT-RTX should optimize the model.

IBuilderConfig* config = builder->createBuilderConfig();

This interface has many properties that you can set to control how TensorRT-RTX optimizes the network. One important property is the maximum workspace size. Layer implementations often require a temporary workspace, and this parameter limits the maximum size that any layer in the network can use. If insufficient workspace is provided, TensorRT-RTX may not be able to find an implementation for a layer. By default, the workspace is set to the total global memory size of the given device; restrict it when necessary, for example, when multiple engines are to be built on a single device.

config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, 1U << 20);

Another significant consideration is the maximum shared memory allocation for the CUDA backend implementation. This allocation becomes pivotal in scenarios where TensorRT-RTX needs to coexist with other applications, such as when both TensorRT-RTX and DirectX concurrently utilize the GPU.

config->setMemoryPoolLimit(MemoryPoolType::kTACTIC_SHARED_MEMORY, 48 << 10);

Once the configuration has been specified, the engine can be built.

IHostMemory* serializedModel = builder->buildSerializedNetwork(*network, *config);

Since the serialized engine contains the necessary copies of the weights, the parser, network definition, builder configuration, and builder are no longer necessary and may be safely deleted:

delete parser;
delete network;
delete config;
delete builder;

The engine can then be saved to disk, and the buffer into which it was serialized can be deleted.

delete serializedModel

The above steps for building an engine are called ahead-of-time (AOT) compilation in TensorRT-RTX. This typically takes more time than the just-in-time (JIT) compilation step; typically around 15 seconds for many networks. Therefore, you should compile and save your engine in advance of the application’s use of it. Depending on your needs and the compilation time for your specific network, you may choose to build the engine before deployment, during application install, or on first run of the application.

Deserializing a TensorRT-RTX Engine#

When you have a previously serialized optimized model and want to perform inference, you must first create an instance of the Runtime interface. Like the builder, the runtime requires an instance of the logger:

IRuntime* runtime = createInferRuntime(logger);

TensorRT-RTX provides three main methods to deserialize an engine, each with its use case and benefits.

In-memory Deserialization#

This method is straightforward and suitable for smaller models or when memory isn’t a constraint.

std::vector<char> modelData = readModelFromFile("model.engine");
ICudaEngine* engine = runtime->deserializeCudaEngine(modelData.data(), modelData.size());

`IStreamReaderV2` Deserialization#

This method allows for a more controlled reading of the engine file and is useful for custom file handling or weight streaming. It supports reading from both host and device pointers and enables potential performance improvements. With this approach, reading the entire TensorRT-RTX engine file into a buffer to be deserialized is unnecessary, as IStreamReaderV2 allows reading the file in chunks as needed, and possibly bypassing the CPU, thus reducing peak CPU memory usage.

class MyStreamReaderV2 : public IStreamReaderV2 {
    // Custom implementation with support for device memory reading
};
MyStreamReaderV2 readerV2("model.engine");
ICudaEngine* engine = runtime->deserializeCudaEngine(readerV2);

The IStreamReaderV2 approach is particularly beneficial for large models or when using advanced features like GPUDirect or weight streaming. It can significantly reduce engine load time and memory usage.

When choosing a deserialization method, consider your specific requirements:

For small models or simple use cases, in-memory deserialization is often sufficient.
For large models or when memory efficiency is crucial, consider using IStreamReaderV2.
If you need custom file handling or weight streaming capabilities, IStreamReaderV2 provides the necessary flexibility.

Checking Engine Compatibility#

Engines you create might not always be compatible with all TensorRT-RTX runtimes or all NVIDIA GPUs, for example, if you constrain the compute capabilities of your engine using the Compute Capability API or because the engine was compiled with a different version of TensorRT-RTX.

You can use the Engine Compatibility API to quickly check whether the engine is expected to work, and potentially whether it might work but suboptimally:

int64_t const nbHeaderBytes = runtime->getEngineHeaderSize();
auto dataPtr = static_cast<void const*>(serializedEngine->data());
if (serializedEngine->size() < static_cast<size_t>(nbHeaderBytes))
{
    std::cerr << "Serialized engine data is smaller than expected header size!" << std::endl;
    return EXIT_FAILURE;
}

// Diagnostics is an invalidity bitmask and is useful for debugging.
uint64_t diagnostics;
auto const validity = runtime->getEngineValidity(dataPtr, nbHeaderBytes, &diagnostics);
if (validity == nvinfer1::EngineValidity::kINVALID)
{
    std::cerr << "Engine can not be executed." << std::endl;
    return;
}

This API can give more detail on why an engine won’t run on the local machine and runtime as compared to attempting to run the engine. And because this API only examines the header portion of the engine, it can be faster than attempting to run the engine. Refer to the sample code for more information.

Performing Inference#

The engine holds the optimized model, but you must manage additional states for intermediate activations to perform inference. This is done using the ExecutionContext interface:

IExecutionContext *context = engine->createExecutionContext();

An engine can have multiple execution contexts, allowing one set of weights to be used for multiple overlapping inference tasks. Creating the execution context triggers JIT compilation of portions of your network’s engine. This step is fairly quick, but you can speed up subsequent runs by using IRuntimeCache. For more information, refer to the Working with Runtime Cache section.

To perform inference, you must pass TensorRT-RTX buffers for input and output, which TensorRT-RTX requires you to specify with calls to setTensorAddress, which takes the tensor’s name and the buffer’s address. You can query the engine using the names you provided for input and output tensors to find the right positions in the array:

context->setTensorAddress(INPUT_NAME, inputBuffer);
context->setTensorAddress(OUTPUT_NAME, outputBuffer);

If the engine was built with dynamic shapes, you must also specify the input shapes:

context->setInputShape(INPUT_NAME, inputDims);

You can then call TensorRT-RTX’s method enqueueV3 to start inference using a CUDA stream:

context->enqueueV3(stream);

To determine when the kernels (and possibly cudaMemcpyAsync()) are complete, use standard CUDA synchronization mechanisms such as events or waiting on the stream.