C++ API Documentation#
Attention
This is the TensorRT-RTX C++ API for the NVIDIA TensorRT library. The NVIDIA TensorRT-RTX C++ API allows developers to import, generate, and deploy networks using C++. Networks can be imported directly from ONNX. They may also be created programmatically by instantiating individual layers and setting parameters and weights directly.
This section illustrates the basic usage of the C++ API, assuming you start with an ONNX model. Refer to the samples in the Tensor-RTX open source repository for more information.
The C++ API can be accessed through the header NvInfer.h
and is in the nvinfer1
namespace. For example, a simple application might begin with:
#include “NvInfer.h”
using namespace nvinfer1;
Interface classes in the TensorRT-RTX C++ API begin with the prefix I
, such as ILogger
and IBuilder
.
A CUDA context is automatically created the first time TensorRT-RTX calls CUDA if none exists before that point. However, creating and configuring the CUDA context yourself is generally preferable before the first call to TensorRT-RTX.
The code in this chapter does not use smart pointers to illustrate object lifetimes; however, their use is recommended with TensorRT-RTX interfaces.
The Build Phase#
To create a builder, you first must instantiate the ILogger
interface. This example captures all warning messages but ignores informational messages:
class Logger : public ILogger
{
void log(Severity severity, const char* msg) noexcept override
{
// suppress info-level messages
if (severity <= Severity::kWARNING)
std::cout << msg << std::endl;
}
} logger;
You can then create an instance of the builder:
IBuilder* builder = createInferBuilder(logger);
Creating a Network Definition#
After the builder has been created, the first step in optimizing a model is to create a network definition. The network creation options are specified using a combination of flags OR-d together.
Finally, create a network:
INetworkDefinition* network = builder->createNetworkV2(flag);
Creating a Network Definition from Scratch (Advanced)#
Instead of using a parser, you can define the network directly to TensorRT-RTX via the Network Definition API. This scenario assumes that the per-layer weights are ready in host memory to pass to TensorRT-RTX during the network creation.
This example creates a simple network with Input, Convolution, Pooling, MatrixMultiply, Shuffle, Activation, and Softmax layers. It also loads the weights into a weightMap data structure, which is used in the following code.
First, create the builder and network objects.
std::unique_ptr<Logger> myLogger;
auto builder = std::unique_ptr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(&myLogger));
auto network = std::unique_ptr<nvinfer1::INetworkDefinition>(0);
Add the Input layer to the network by specifying the input tensor’s name, datatype, and full dimensions. A network can have multiple inputs, although in this sample, there is only one:
auto data = network->addInput(INPUT_BLOB_NAME, datatype, Dims4{1, 1, INPUT_H, INPUT_W});
Add the Convolution layer with hidden layer input nodes, strides, and weights for filter and bias.
auto conv1 = network->addConvolution(
*data->getOutput(0), 20, DimsHW{5, 5}, weightMap["conv1filter"], weightMap["conv1bias"]);
conv1->setStride(DimsHW{1, 1});
Note
Weights passed to TensorRT-RTX layers are in host memory.
Add the Pooling layer; note that the output from the previous layer is passed as input.
auto pool1 = network->addPooling(*conv1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});
pool1->setStride(DimsHW{2, 2});
Add a Shuffle layer to reshape the input in preparation for matrix multiplication:
int32_t const batch = input->getDimensions().d[0];
int32_t const mmInputs = input.getDimensions().d[1] * input.getDimensions().d[2] * input.getDimensions().d[3];
auto inputReshape = network->addShuffle(*input);
inputReshape->setReshapeDimensions(Dims{2, {batch, mmInputs}});
Now, add a MatrixMultiply layer. The model exporter provided transposed weights, so the kTRANSPOSE option is specified.
IConstantLayer* filterConst = network->addConstant(Dims{2, {nbOutputs, mmInputs}}, mWeightMap["ip1filter"]);
auto mm = network->addMatrixMultiply(*inputReshape->getOutput(0), MatrixOperation::kNONE, *filterConst->getOutput(0), MatrixOperation::kTRANSPOSE);
Add the bias, which will broadcast across the batch dimension.
auto biasConst = network->addConstant(Dims{2, {1, nbOutputs}}, mWeightMap["ip1bias"]);
auto biasAdd = network->addElementWise(*mm->getOutput(0), *biasConst->getOutput(0), ElementWiseOperation::kSUM);
Add the ReLU Activation layer:
auto relu1 = network->addActivation(*ip1->getOutput(0), ActivationType::kRELU);
Add the SoftMax layer to calculate the final probabilities:
auto prob = network->addSoftMax(*relu1->getOutput(0));
Add a name for the output of the SoftMax layer so that the tensor can be bound to a memory buffer at inference time:
prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);
Mark it as the output of the entire network:
network->markOutput(*prob->getOutput(0));
The network representing the MNIST model has now been fully constructed. For instructions on how to build a TensorRT-RTX engine and run an inference with this network, refer to the Building a TensorRT-RTX Engine and Performing Inference sections.
For more information regarding layers, refer to the TensorRT-RTX Operator documentation.
Importing a Model Using the ONNX Parser#
Now, the network definition must be populated from the ONNX representation. The ONNX parser API is in the file NvOnnxParser.h
, and the parser is in the nvonnxparser
C++ namespace.
#include “NvOnnxParser.h”
using namespace nvonnxparser;
You can create an ONNX parser to populate the network as follows:
IParser* parser = createParser(*network, logger);
Then, read the model file and process any errors.
parser->parseFromFile(modelFile,
static_cast<int32_t>(ILogger::Severity::kWARNING));
for (int32_t i = 0; i < parser->getNbErrors(); ++i)
{
std::cout << parser->getError(i)->desc() << std::endl;
}
An important aspect of a TensorRT-RTX network definition is that it contains pointers to model weights, which the builder copies into the optimized engine. Since the network was created using the parser, the parser owns the memory occupied by the weights, so the parser object should not be deleted until after the builder has run.
Building a TensorRT-RTX Engine#
The next step is to create a build configuration specifying how TensorRT-RTX should optimize the model.
IBuilderConfig* config = builder->createBuilderConfig();
This interface has many properties that you can set to control how TensorRT-RTX optimizes the network. One important property is the maximum workspace size. Layer implementations often require a temporary workspace, and this parameter limits the maximum size that any layer in the network can use. If insufficient workspace is provided, TensorRT-RTX may not be able to find an implementation for a layer. By default, the workspace is set to the total global memory size of the given device; restrict it when necessary, for example, when multiple engines are to be built on a single device.
config->setMemoryPoolLimit(MemoryPoolType::kWORKSPACE, 1U << 20);
Another significant consideration is the maximum shared memory allocation for the CUDA backend implementation. This allocation becomes pivotal in scenarios where TensorRT-RTX needs to coexist with other applications, such as when both TensorRT-RTX and DirectX concurrently utilize the GPU.
config->setMemoryPoolLimit(MemoryPoolType::kTACTIC_SHARED_MEMORY, 48 << 10);
Once the configuration has been specified, the engine can be built.
IHostMemory* serializedModel = builder->buildSerializedNetwork(*network, *config);
Since the serialized engine contains the necessary copies of the weights, the parser, network definition, builder configuration, and builder are no longer necessary and may be safely deleted:
delete parser;
delete network;
delete config;
delete builder;
The engine can then be saved to disk, and the buffer into which it was serialized can be deleted.
delete serializedModel
The above steps for building an engine are called ahead-of-time (AOT) compilation in TensorRT-RTX. This typically takes more time than the just-in-time (JIT) compilation step; typically around 15 seconds for many networks. Therefore, you should compile and save your engine in advance of the application’s use of it. Depending on your needs and the compilation time for your specific network, you may choose to build the engine before deployment, during application install, or on first run of the application.
Deserializing a TensorRT-RTX Engine#
When you have a previously serialized optimized model and want to perform inference, you must first create an instance of the Runtime interface. Like the builder, the runtime requires an instance of the logger:
IRuntime* runtime = createInferRuntime(logger);
TensorRT-RTX provides three main methods to deserialize an engine, each with its use case and benefits.
In-memory Deserialization
This method is straightforward and suitable for smaller models or when memory isn’t a constraint.
std::vector<char> modelData = readModelFromFile("model.engine");
ICudaEngine* engine = runtime->deserializeCudaEngine(modelData.data(), modelData.size());
IStreamReaderV2 Deserialization
This method allows for a more controlled reading of the engine file and is useful for custom file handling or weight streaming. It supports reading from both host and device pointers and enables potential performance improvements. With this approach, reading the entire TensorRT-RTX engine file into a buffer to be deserialized is unnecessary, as IStreamReaderV2
allows reading the file in chunks as needed, and possibly bypassing the CPU, thus reducing peak CPU memory usage.
class MyStreamReaderV2 : public IStreamReaderV2 {
// Custom implementation with support for device memory reading
};
MyStreamReaderV2 readerV2("model.engine");
ICudaEngine* engine = runtime->deserializeCudaEngine(readerV2);
The IStreamReaderV2
approach is particularly beneficial for large models or when using advanced features like GPUDirect or weight streaming. It can significantly reduce engine load time and memory usage.
When choosing a deserialization method, consider your specific requirements:
For small models or simple use cases, in-memory deserialization is often sufficient.
For large models or when memory efficiency is crucial, consider using
IStreamReaderV2
.If you need custom file handling or weight streaming capabilities,
IStreamReaderV2
provides the necessary flexibility.
Performing Inference#
The engine holds the optimized model, but you must manage additional states for intermediate activations to perform inference. This is done using the ExecutionContext
interface:
IExecutionContext *context = engine->createExecutionContext();
An engine can have multiple execution contexts, allowing one set of weights to be used for multiple overlapping inference tasks. Creating the execution context triggers JIT compilation of portions of your network’s engine. This step is fairly quick, but you can speed up subsequent runs by using IRuntimeCache
. For more information, refer to the Working with Runtime Cache section.
To perform inference, you must pass TensorRT-RTX buffers for input and output, which TensorRT-RTX requires you to specify with calls to setTensorAddress
, which takes the tensor’s name and the buffer’s address. You can query the engine using the names you provided for input and output tensors to find the right positions in the array:
context->setTensorAddress(INPUT_NAME, inputBuffer);
context->setTensorAddress(OUTPUT_NAME, outputBuffer);
If the engine was built with dynamic shapes, you must also specify the input shapes:
context->setInputShape(INPUT_NAME, inputDims);
You can then call TensorRT-RTX’s method enqueueV3
to start inference using a CUDA stream:
context->enqueueV3(stream);
To determine when the kernels (and possibly cudaMemcpyAsync()
) are complete, use standard CUDA synchronization mechanisms such as events or waiting on the stream.