Deploy Your First Model#
This walkthrough demonstrates the complete TensorRT-RTX deployment pipeline — from downloading an ONNX model to running optimized inference — using the tensorrt_rtx command-line tool. By the end, you will have:
Built a portable TensorRT-RTX engine from an ONNX model (AOT phase)
Run inference with JIT compilation on your GPU
Enabled runtime caching for faster subsequent runs
For background on how AOT and JIT compilation work together, refer to the Architecture Overview.
Step 1: Download a Sample Model#
This example uses a pre-trained ResNet-50 model from the ONNX Model Zoo:
wget https://download.onnxruntime.ai/onnx/models/resnet50.tar.gz
tar xzf resnet50.tar.gz
This unpacks the model to resnet50/model.onnx.
Using your own model
ONNX models can be exported from most deep learning frameworks:
PyTorch —
torch.onnx.export()TensorFlow —
tf2onnxHugging Face — Optimum library for transformer models
For a broader overview of ONNX exports, refer to the ONNX Conversion Guide.
Step 2: Build the Engine (AOT)#
Use the tensorrt_rtx CLI to convert the ONNX model into a TensorRT-RTX engine file. This step does not require a GPU and typically takes 20-30 seconds for most models, with a maximum of approximately 60 seconds for complex models.
tensorrt_rtx --onnx=resnet50/model.onnx --saveEngine=resnet_engine.trt
The resulting resnet_engine.trt is portable across all supported operating systems (Linux and Windows) and RTX GPUs with compute capability 7.5 and above (Turing architecture and later). For optimal performance on newer architectures, engines built on Ampere (8.0+) or later GPUs can include optimizations specific to those architectures.
Tip
When bundling TensorRT-RTX within an application, perform this step during your application’s install or first-launch flow.
Step 3: Run Inference (JIT)#
Load the engine to run inference. On the first invocation, TensorRT-RTX JIT-compiles the engine for the specific GPU — selecting optimal kernels automatically. This typically takes under 5 seconds for most models.
tensorrt_rtx --loadEngine=resnet_engine.trt \
--runtimeCacheFile=resnet.cache
The --runtimeCacheFile flag caches the compiled kernels so subsequent runs skip JIT compilation entirely. In a deployed application, store both the engine file and runtime cache in your application’s data directory (for example, AppData on Windows).
Next Steps#
ONNX Conversion Guide — Export models from PyTorch, TensorFlow, and other frameworks
Using the TensorRT-RTX Runtime API — Run inference programmatically with the C++ or Python API
Working with Runtime Cache — Advanced caching strategies for production deployments