Build Your First Engine#
This tutorial walks you through building and running your first NVIDIA TensorRT engine end-to-end in about 10 minutes. It is intentionally narrow: it picks one model, one build command, and one inference command. For the full menu of workflows (PyTorch source models, ONNX models, multiple runtimes, dynamic shapes, quantization), refer to the Quick Start Guide after you finish this tutorial.
This is a tutorial, not a how-to guide. The goal is to give you a working engine on disk and a successful inference run, not to teach you the TensorRT API. After you finish, you will know that your install works and what a successful build looks like end-to-end.
Prerequisites#
You should already have:
NVIDIA TensorRT 11.1.0 installed. Refer to Installing TensorRT.
A supported NVIDIA GPU (refer to the Support Matrix).
CUDA 13.3 on your
PATH.
You do not need to know the TensorRT API. You do need a working Python environment if you plan to run the optional Step 4.
What you will Build#
By the end of this tutorial, you will have:
A ResNet-50 ONNX model on disk.
A TensorRT engine compiled from that model, saved as
resnet50.engine.A single inference run printed to your terminal.
Total time: about 10 minutes. Total commands: 5.
Step 1: Get the Model#
Download a ResNet-50 ONNX model:
wget https://github.com/onnx/models/raw/main/validated/vision/classification/resnet/model/resnet50-v2-7.onnx \
-O resnet50.onnx
If wget is not available on your system, download the file with your
browser or with curl -L -o resnet50.onnx <url>.
After this step, ls should show resnet50.onnx in the current directory.
Step 2: Build the Engine#
Build an FP16 engine from the ONNX file using the trtexec command-line
tool that ships with TensorRT:
trtexec --onnx=resnet50.onnx --saveEngine=resnet50.engine
The build typically takes 30 to 90 seconds on a modern NVIDIA GPU. You will
see TensorRT log lines describing layer fusions, kernel selection, and
profiling. When the build finishes, trtexec reports the engine size on
disk and the time taken to build.
After this step, ls should show both resnet50.onnx and
resnet50.engine.
Step 3: Verify the Engine Runs#
Run inference with random inputs to confirm the engine loads and executes:
trtexec --loadEngine=resnet50.engine --shapes=data:1x3x224x224
Expected inference summary (representative)
[I] Average on 10 runs - GPU latency: 1.2 ms (end to end)
[I] Throughput: 833.3 qps
[I] PASSED
You should see latency numbers (mean, median, and percentiles) and a
PASSED summary at the end of the output. If you see FAILED or any
error, refer to Troubleshooting.
What you just did#
You proved three things in 5 commands:
Your TensorRT install can read an ONNX file.
The build phase can compile a model into a GPU-specific engine.
The runtime can load and execute that engine.
This is the minimum viable inference loop. Everything else in the docs builds on this foundation.
Step 4 (optional): Run inference from Python#
If you want to drive the engine from your own code rather than from
trtexec, here is the smallest Python runtime program that loads the
engine, runs one inference with random inputs, and prints the output shape
and predicted class index.
First install the cuda-python package if you do not have it:
pip install cuda-python
Then save the following as run_engine.py and run it with
python run_engine.py:
import numpy as np
import tensorrt as trt
from cuda.bindings import runtime as cudart
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with open("resnet50.engine", "rb") as f:
engine_bytes = f.read()
runtime = trt.Runtime(TRT_LOGGER)
engine = runtime.deserialize_cuda_engine(engine_bytes)
context = engine.create_execution_context()
data = np.random.randn(1, 3, 224, 224).astype(np.float32)
output = np.empty((1, 1000), dtype=np.float32)
_, d_in = cudart.cudaMalloc(data.nbytes)
_, d_out = cudart.cudaMalloc(output.nbytes)
cudart.cudaMemcpy(
d_in, data.ctypes.data, data.nbytes,
cudart.cudaMemcpyKind.cudaMemcpyHostToDevice,
)
context.set_tensor_address("data", d_in)
context.set_tensor_address("resnetv24_dense0_fwd", d_out)
context.execute_async_v3(stream_handle=0)
cudart.cudaStreamSynchronize(0)
cudart.cudaMemcpy(
output.ctypes.data, d_out, output.nbytes,
cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost,
)
cudart.cudaFree(d_in)
cudart.cudaFree(d_out)
print(f"Output shape: {output.shape}")
print(f"Predicted class index: {int(output.argmax())}")
This program is intentionally barebones. It does not preprocess a real image, does not use a real label map, and does not manage CUDA streams or contexts beyond the minimum needed to get one inference through. The same logic with proper resource management lives in the Quick Start Guide and in the TensorRT samples.
Where to go next#
You have a working engine. From here:
TensorRT Capabilities summarizes what the library supports and links to deeper how-to pages.
C++ API Documentation and Python API Documentation walk through the full build and run workflow.
Quick Start Guide covers the same build and run flow with C++ runtime examples, PyTorch source models, and more options.
Working with Dynamic Shapes removes the fixed batch size and resolution you used above.
Performance Benchmarking shows how to measure your engine’s real latency and throughput on your hardware.
Working with Quantized Types covers INT8 and FP8 paths once you want lower precision than FP16.