Best Practices#
This guide presents the two pillars of TensorRT performance work: benchmarking (measuring what your model actually does) and optimization (changing what your model does so it runs faster). Treat them as a feedback loop: measure first, optimize, then measure again to confirm the change had the impact you expected.
Benchmarking#
Benchmarking is how you turn a TensorRT model into trustworthy numbers (latency, throughput, per-layer cost) that you can compare across builds, hardware, and configurations. The Performance Benchmarking chapter walks through:
Running
trtexecagainst ONNX models, quantized ONNX models, and serialized enginesRunning Torch-TRT against trained PyTorch models, quantize PyTorch models, and measure performance
Measuring latency and throughput with wall-clock timers and CUDA events, and tracking memory usage
Profiling per-layer performance with TensorRT’s built-in profiler, Nsight Deep Learning Designer, NVIDIA Nsight Systems, and DLA-specific tooling
Controlling the hardware/software environment (GPU clocks, power and thermal throttling, PCIe transfers, driver mode, sync mode) so your numbers are stable and reproducible
Without a stable measurement baseline, every optimization you try is a guess. Start here.
Optimization#
Once you trust your numbers, the Optimizing TensorRT Performance chapter covers the techniques you can apply to push them further:
Increasing parallelism with batching, CUDA graphs, and within-/cross-inference multi-streaming
Letting the builder do more work for you via layer fusion, pointwise fusion, and Q/DQ fusion
Tuning specific layer types and aligning tensors for Tensor Core acceleration
Advanced topics: deterministic tactic selection, Python performance, and the accuracy ↔ performance tradeoff
Cutting engine build time with timing caches and builder optimization levels
Each section is independent, so you can jump straight to the techniques most relevant to the bottlenecks your benchmarks surfaced.
See also
- How TensorRT Works
Architecture concepts (builder, runtime, I/O pipeline) that underpin the techniques in both chapters.