Porting Guide for TensorRT Applications#

TensorRT-RTX is a framework for optimizing the inference performance of your AI models, and is designed specifically to simplify deployment of your applications to end-user PCs with NVIDIA RTX GPUs running Windows or Linux. This section will walk you through:

Understanding whether TensorRT-RTX is the right inference framework for your application.
How to port an existing TensorRT application to use TensorRT-RTX.

Choosing an Inference Solution#

Both TensorRT and TensorRT-RTX excel at optimizing non-LLM models such as CNNs, diffusion models, transformers, and more. At first glance, TensorRT-RTX appears quite similar to TensorRT, as we aimed to make porting TensorRT applications straightforward. However, there are key differences in their capabilities that make TensorRT-RTX more suitable for certain applications, while TensorRT is better for others.

Refer to the Getting Started with TensorRT section for a high-level overview of NVIDIA inference solutions. Let’s explore the key differences between TensorRT and TensorRT-RTX in more detail.

NVIDIA TensorRT

TensorRT is ideal for running models in a data center, particularly when deploying to one or a few specific GPU models, such as GB100 or H100. When you build a TensorRT engine for your model, TensorRT performs autotuning directly on the target device, necessitating access to a machine with the target GPU model. The engine produced by TensorRT will then run on that GPU with high performance.

For optimal performance across multiple GPU models, you need to build and deploy an engine for each one. Additionally, you must include the TensorRT runtime libraries, which are quite large (around 1 GB, depending on the OS).

While TensorRT often delivers the best performance, deploying TensorRT-optimized models to Windows and Linux PC users with various NVIDIA GPUs can be challenging.

NVIDIA TensorRT-RTX

To address deployment challenges, TensorRT-RTX:

Uses “Just-In-Time” (JIT) compilation on the end-user device.
Quickly produces high-performance inference engines for any NVIDIA RTX GPU starting from the Turing family.
Is deployed with a library smaller than 200 MB.

These features enable TensorRT-RTX to allow the deployment of the same application and even the same engine to end-users, while still achieving high-performance inference across various NVIDIA RTX GPUs.

For targeting datacenter GPUs and requiring the highest possible inference throughput, TensorRT is the best choice. However, for deploying applications to end-users on Windows and Linux PCs with RTX GPUs, TensorRT-RTX significantly improves the inference performance of your models with a lighter-weight and simpler deployment process.

Benchmarking#

To determine if each solution can meet your performance requirements, compare the performance of TensorRT-RTX and TensorRT. You can use the command-line tool provided by each product to assist with the comparison by following these steps:

Produce a serialized engine for your model, either with your own application code and storing the result of IBuilder::buildSerializedNetwork() to a file, or using the appropriate command-line tool to process an ONNX file, for example:
```
# TensorRT:
trtexec --onnx=myModel.onnx --saveEngine=myModel.plan
# TensorRT-RTX:
tensorrt_rtx --onnx=myModel.onnx --saveEngine=myModel.rtxplan
```

Use the command-line tools to measure performance.

# TensorRT:
trtexec --loadEngine=myModel.plan
# TensorRT-RTX:
tensorrt_rtx --loadEngine=myModel.rtxplan

For more information on best practices and flags for performance measurement of TensorRT and TensorRT-RTX, refer to the Performance Benchmarking with TensorRT Plan File section.

Engines#

Similar to TensorRT, TensorRT-RTX requires you to compile your model into an “engine” before using it for inference. In TensorRT-RTX, these are occasionally referred to as “JIT-able engines” because they can be Just-In-Time (JIT) compiled on the end-user machine. For the remainder of this document we will refer to them as “TensorRT-RTX engines” or simply “engines” when the meaning is unambiguous.

TensorRT-RTX engines are produced in TensorRT-RTX when you invoke IBuilder::buildSerializedNetwork(). This Ahead-of-Time (AOT) compilation step typically completes in under 15 seconds. You can save the contents of the resulting IHostBuffer for later use, and optionally exclude weights.

Although TensorRT-RTX engines appear similar to TensorRT engines in the APIs, TensorRT engines and TensorRT-RTX engines are not compatible. Therefore, when porting your application from TensorRT to TensorRT-RTX, you must build new engines. Next, we will discuss your options for building these engines.

Deployment Options#

Every model you compile with TensorRT-RTX will result in an engine that your application later loads and executes on the end-user machine. However, you have a couple of major options for when and where you perform AOT compilation, depending on your application’s needs and goals.

CPU-Only AOT#

By default, AOT compilation uses only the CPU and produces an engine compatible with all Ampere and later RTX GPUs. This allows you to compile your TensorRT-RTX engine in advance and then include the engine bytes as part of your application download.

If you also want to support Turing GPUs, such as the RTX 20 series, create a second engine specifically targeted at CUDA Compute Capability 7.5 and deploy both engines: one for Ampere and later GPUs, and one for Turing GPUs. For more information, refer to the CPU-Only AOT and TensorRT-RTX Engines and APIs sections.

Note that TensorRT-RTX engines are not currently compatible between different releases of TensorRT-RTX. Therefore, if you deploy a new version of the TensorRT-RTX runtime library, you must also provide a new engine compiled with the same version of TensorRT-RTX.

On-Device AOT#

As mentioned earlier, TensorRT-RTX engines can typically be compiled in 15 seconds or less. Given this speed, you may want your application to perform AOT compilation directly on the end-user’s machine. For example, you could perform AOT compilation during installation or upon the first run of the application, targeting only the end-user’s specific GPU. You can then save that engine to persistent storage on the user’s machine for later use.

This approach results in a smaller engine and, in some cases (especially multi-head attention), can yield better performance. You should measure the performance of AOT compilation and inference for your model to determine the best deployment approach.

Note that TensorRT-RTX engines are not currently compatible between different releases of TensorRT-RTX. If you choose the On-Device AOT strategy, ensure that your application rebuilds the engine if you update the version of the TensorRT-RTX runtime library. Additionally, check for changes to the user’s installed GPU; if they upgrade to a newer NVIDIA GPU, you will need to rebuild the engine.

Libraries#

TensorRT-RTX requires two libraries:

libtensorrt_rtx.so (Linux) or tensorrt_rtx_1_0.dll (Windows)

The tensorrt_rtx library provides all the core TensorRT-RTX C++ APIs for AOT compilation, JIT compilation, and inference. Every TensorRT-RTX application will use this library. It is analogous to the Nvinfer library from TensorRT.
libtensorrt_onnxparser_rtx.so (Linux) or tensorrt_onnxparser_rtx_1_0.dll (Windows)

The ONNX parser library allows you to read ONNX files to produce a network that can be compiled into a TensorRT-RTX engine. TensorRT-RTX applications that use ONNX files will require this library. It is analogous to the nvonnxparser library from TensorRT.

For example, if your application uses a CMake build, you can link to TensorRT and the ONNX parser as follows:

target_link_libraries(helloWorld PRIVATE nvinfer nvonnxparser)

Change this line in your CMakeLists.txt file to:

target_link_libraries(helloWorld PRIVATE tensorrt_rtx tensorrt_onnxparser_rtx)

Using distinct library names allows you to have TensorRT and TensorRT-RTX installed simultaneously.

Python Module#

TensorRT-RTX provides Python bindings under the module name tensorrt_rtx, while TensorRT’s Python module name is tensorrt. Therefore, if you are using TensorRT’s Python modules and want to migrate to TensorRT-RTX, you will need to change the module name that you import. For example, to switch entirely from TensorRT to TensorRT-RTX, you might have an import statement in your Python code like this:

import tensorrt as trt

To use TensorRT-RTX instead, modify this statement.

import tensorrt_rtx as trt

The APIs are mostly compatible, so your existing Python code will generally work. If you prefer to use both TensorRT and TensorRT-RTX in your application, you can import and use both.

import tensorrt as trt
import tensorrt_rtx as trtrtx

APIs#

We aim to make it easy for you to port your TensorRT application to TensorRT-RTX if it suits your needs. Therefore, the C++ and Python APIs in TensorRT-RTX are almost identical to those in TensorRT, meaning your application will likely work with just the above library or module changes.

However, you should be aware of some key differences in TensorRT-RTX:

Nearly all deprecated APIs have been removed from TensorRT-RTX. If you were using an API that was deprecated in TensorRT and is now removed in TensorRT-RTX, you can refer to the original TensorRT API documentation for suggestions on alternative APIs to use.

An exception is the APIs related to the “weak typing” feature of TensorRT. These APIs are retained in TensorRT-RTX but are unsupported, and using them will result in a warning being logged by TensorRT-RTX. This decision was made because these APIs are common in application code, allowing you to transition without immediately removing calls to these APIs.
Plugins are not supported in TensorRT-RTX. Although the headers still include some of the classes to allow most of your plugin-related code to compile, you will be unable to add plugin layers.
TensorRT-RTX introduces new APIs that help you further optimize inference behavior for your specific application on end-user PCs. You can learn more at the following links: