Example Deployment Using ONNX#

ONNX is a framework-agnostic model format that can be exported from most major frameworks, including TensorFlow and PyTorch. TensorRT-RTX provides a parser for directly converting ONNX into a TensorRT-RTX engine.

Specify the Model#

TensorRT-RTX specification requires different model formats to convert a model successfully. The ONNX path requires that models be saved in ONNX.

We will use ResNet-50, a basic backbone vision model that can be used for various purposes. We will perform classification using a pre-trained ResNet-50 ONNX model included with the ONNX model zoo.

Download a pre-trained ResNet-50 model from the ONNX model zoo using wget and untar it.

wget https://download.onnxruntime.ai/onnx/models/resnet50.tar.gz
tar xzf resnet50.tar.gz

This will unpack a pretrained ResNet-50 .onnx file to the path resnet50/model.onnx.

ONNX models can be exported from most popular deep learning training frameworks such as PyTorch or TensorFlow. When using transformer models from Hugging Face, consider the Optimum library.

Ahead-of-Time (AOT) Build#

The tensorrt_rtx command-line tool can be used to build a TensorRT-RTX engine file. When bundling TensorRT-RTX within an application, this step will usually be performed during the installation process. It does not require access to a GPU, and is expected to complete within 60 seconds for most models and systems.

Run this conversion as follows:

tensorrt_rtx --onnx=resnet50/model.onnx --saveEngine=resnet_engine.trt

This will convert our resnet50/model.onnx to a TensorRT-RTX engine named resnet_engine.trt.

Note

This engine is expected to run on all supported operating systems (Linux and Windows), as well as on all supported RTX GPUs with compute capability 8.0 and above.

Just-in-Time (JIT) Compilation and Inference#

After the TensorRT-RTX engine file is built, it can be used to perform inference on input data. At the time of the first inference run, just-in-time compilation is performed to select the exact GPU kernels for running on the target machine. This step should run very fast (< 5 s) for most models and systems. Optional kernel caching can be performed to speed up the time to first inference in subsequent invocations.

tensorrt_rtx --loadEngine=resnet_engine.trt \
             --runtimeCachePath=resnet.cache

Both the engine file and the runtime cache will typically be stored in the AppData path for your application.