Build Your First Engine#

This tutorial walks you through building and running your first NVIDIA TensorRT engine end-to-end in about 10 minutes. It is intentionally narrow: it picks one model, one build command, and one inference command. For the full menu of workflows (PyTorch source models, ONNX models, multiple runtimes, dynamic shapes, quantization), refer to the Quick Start Guide after you finish this tutorial.

This is a tutorial, not a how-to guide. The goal is to give you a working engine on disk and a successful inference run, not to teach you the TensorRT API. After you finish, you will know that your install works and what a successful build looks like end-to-end.

Prerequisites#

You should already have:

You do not need to know the TensorRT API. You do need a working Python environment if you plan to run the optional Step 4.

What you will Build#

By the end of this tutorial, you will have:

  • A ResNet-50 ONNX model on disk.

  • A TensorRT engine compiled from that model, saved as resnet50.engine.

  • A single inference run printed to your terminal.

Total time: about 10 minutes. Total commands: 5.

Step 1: Get the Model#

Download a ResNet-50 ONNX model:

wget https://github.com/onnx/models/raw/main/validated/vision/classification/resnet/model/resnet50-v2-7.onnx \
    -O resnet50.onnx

If wget is not available on your system, download the file with your browser or with curl -L -o resnet50.onnx <url>.

After this step, ls should show resnet50.onnx in the current directory.

Step 2: Build the Engine#

Build an FP16 engine from the ONNX file using the trtexec command-line tool that ships with TensorRT:

trtexec --onnx=resnet50.onnx --saveEngine=resnet50.engine

The build typically takes 30 to 90 seconds on a modern NVIDIA GPU. You will see TensorRT log lines describing layer fusions, kernel selection, and profiling. When the build finishes, trtexec reports the engine size on disk and the time taken to build.

After this step, ls should show both resnet50.onnx and resnet50.engine.

Step 3: Verify the Engine Runs#

Run inference with random inputs to confirm the engine loads and executes:

trtexec --loadEngine=resnet50.engine --shapes=data:1x3x224x224
Expected inference summary (representative)
[I] Average on 10 runs - GPU latency: 1.2 ms (end to end)
[I] Throughput: 833.3 qps
[I] PASSED

You should see latency numbers (mean, median, and percentiles) and a PASSED summary at the end of the output. If you see FAILED or any error, refer to Troubleshooting.

What you just did#

You proved three things in 5 commands:

  1. Your TensorRT install can read an ONNX file.

  2. The build phase can compile a model into a GPU-specific engine.

  3. The runtime can load and execute that engine.

This is the minimum viable inference loop. Everything else in the docs builds on this foundation.

Step 4 (optional): Run inference from Python#

If you want to drive the engine from your own code rather than from trtexec, here is the smallest Python runtime program that loads the engine, runs one inference with random inputs, and prints the output shape and predicted class index.

First install the cuda-python package if you do not have it:

pip install cuda-python

Then save the following as run_engine.py and run it with python run_engine.py:

import numpy as np
import tensorrt as trt
from cuda.bindings import runtime as cudart

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

with open("resnet50.engine", "rb") as f:
    engine_bytes = f.read()

runtime = trt.Runtime(TRT_LOGGER)
engine = runtime.deserialize_cuda_engine(engine_bytes)
context = engine.create_execution_context()

data = np.random.randn(1, 3, 224, 224).astype(np.float32)
output = np.empty((1, 1000), dtype=np.float32)

_, d_in = cudart.cudaMalloc(data.nbytes)
_, d_out = cudart.cudaMalloc(output.nbytes)
cudart.cudaMemcpy(
    d_in, data.ctypes.data, data.nbytes,
    cudart.cudaMemcpyKind.cudaMemcpyHostToDevice,
)

context.set_tensor_address("data", d_in)
context.set_tensor_address("resnetv24_dense0_fwd", d_out)
context.execute_async_v3(stream_handle=0)
cudart.cudaStreamSynchronize(0)

cudart.cudaMemcpy(
    output.ctypes.data, d_out, output.nbytes,
    cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost,
)
cudart.cudaFree(d_in)
cudart.cudaFree(d_out)

print(f"Output shape: {output.shape}")
print(f"Predicted class index: {int(output.argmax())}")

This program is intentionally barebones. It does not preprocess a real image, does not use a real label map, and does not manage CUDA streams or contexts beyond the minimum needed to get one inference through. The same logic with proper resource management lives in the Quick Start Guide and in the TensorRT samples.

Where to go next#

You have a working engine. From here: