Example Deployment Using ONNX#

ONNX is a framework-agnostic model format that can be exported from most major frameworks, including TensorFlow and PyTorch. TensorRT-RTX provides a parser for directly converting ONNX into a TensorRT-RTX engine.

Specify the Model#

TensorRT-RTX specification requires different model formats to convert a model successfully. The ONNX path requires that models be saved in ONNX.

We will use ResNet-50, a basic backbone vision model that can be used for various purposes. We will perform classification using a pre-trained ResNet-50 ONNX model included with the ONNX model zoo.

Download a pre-trained ResNet-50 model from the ONNX model zoo using wget and untar it.

wget https://download.onnxruntime.ai/onnx/models/resnet50.tar.gz
tar xzf resnet50.tar.gz

This will unpack a pretrained ResNet-50 .onnx file to the path resnet50/model.onnx.

ONNX models can be exported from most popular deep learning training frameworks such as PyTorch or TensorFlow. When using transformer models from Hugging Face, consider the Optimum library.

Ahead-of-Time (AOT) Build#

You can use the tensorrt_rtx command-line tool to build a TensorRT-RTX engine file. When bundling TensorRT-RTX within an application, you will usually perform this step during the installation process. It does not require access to a GPU and typically completes within 60 seconds for most models and systems.

Run this conversion as follows:

tensorrt_rtx --onnx=resnet50/model.onnx --saveEngine=resnet_engine.trt

This will convert our resnet50/model.onnx to a TensorRT-RTX engine named resnet_engine.trt.

Note

This engine is expected to run on all supported operating systems (Linux and Windows), as well as on all supported RTX GPUs with compute capability 8.0 and above.

Just-in-Time (JIT) Compilation and Inference#

After you build the TensorRT-RTX engine file, you can use it to perform inference on input data. At the time of the first inference run, just-in-time compilation selects the exact GPU kernels for running on the target machine. This step runs very fast (under 5 seconds) for most models and systems. You can perform optional kernel caching to speed up the time to first inference in subsequent invocations.

tensorrt_rtx --loadEngine=resnet_engine.trt \
             --runtimeCachePath=resnet.cache

Both the engine file and the runtime cache will typically be stored in the AppData path for your application.