Example Deployment Using ONNX#
ONNX is a framework-agnostic model format that can be exported from most major frameworks, including TensorFlow and PyTorch. TensorRT-RTX provides a parser for directly converting ONNX into a TensorRT-RTX engine.
Specify the Model#
TensorRT-RTX specification requires different model formats to convert a model successfully. The ONNX path requires that models be saved in ONNX.
We will use ResNet-50, a basic backbone vision model that can be used for various purposes. We will perform classification using a pre-trained ResNet-50 ONNX model included with the ONNX model zoo.
Download a pre-trained ResNet-50 model from the ONNX model zoo using wget and untar it.
wget https://download.onnxruntime.ai/onnx/models/resnet50.tar.gz
tar xzf resnet50.tar.gz
This will unpack a pretrained ResNet-50 .onnx
file to the path resnet50/model.onnx
.
ONNX models can be exported from most popular deep learning training frameworks such as PyTorch or TensorFlow. When using transformer models from Hugging Face, consider the Optimum library.
Ahead-of-Time (AOT) Build#
The tensorrt_rtx
command-line tool can be used to build a TensorRT-RTX engine file. When bundling TensorRT-RTX within an application, this step will usually be performed during the installation process. It does not require access to a GPU, and is expected to complete within 60 seconds for most models and systems.
Run this conversion as follows:
tensorrt_rtx --onnx=resnet50/model.onnx --saveEngine=resnet_engine.trt
This will convert our resnet50/model.onnx
to a TensorRT-RTX engine named resnet_engine.trt
.
Note
This engine is expected to run on all supported operating systems (Linux and Windows), as well as on all supported RTX GPUs with compute capability 8.0 and above.
Just-in-Time (JIT) Compilation and Inference#
After the TensorRT-RTX engine file is built, it can be used to perform inference on input data. At the time of the first inference run, just-in-time compilation is performed to select the exact GPU kernels for running on the target machine. This step should run very fast (< 5 s) for most models and systems. Optional kernel caching can be performed to speed up the time to first inference in subsequent invocations.
tensorrt_rtx --loadEngine=resnet_engine.trt \
--runtimeCachePath=resnet.cache
Both the engine file and the runtime cache will typically be stored in the AppData path for your application.