Best Practices#
This guide presents the TensorRT performance workflow as a measure-then-optimize loop: benchmarking (measuring what your model actually does), optimization (changing what your model does so it runs faster), and optional global performance tuning (automated build-route search when you need another engine-level pass). Treat benchmarking and optimization as the core feedback loop: measure first, optimize, then measure again to confirm the change had the impact you expected.
Benchmarking#
Benchmarking is how you turn a TensorRT model into trustworthy numbers (latency, throughput, per-layer cost) that you can compare across builds, hardware, and configurations. The Performance Benchmarking chapter walks through:
Running
trtexecagainst ONNX models, quantized ONNX models, and serialized enginesRunning Torch-TRT against trained PyTorch models, quantized PyTorch models, and measuring performance
Measuring latency and throughput with wall-clock timers and CUDA events, and tracking memory usage
Profiling per-layer performance with TensorRT’s built-in profiler, Nsight Deep Learning Designer, NVIDIA Nsight Systems, and DLA-specific tooling
Controlling the hardware/software environment (GPU clocks, power and thermal throttling, PCIe transfers, driver mode, sync mode) so your numbers are stable and reproducible
Without a stable measurement baseline, every optimization you try is a guess. Start here.
Optimization#
Once you trust your numbers, the Optimizing TensorRT Performance chapter covers the techniques you can apply to push them further:
Increasing parallelism with batching, CUDA graphs, and within-/cross-inference multi-streaming
Letting the builder do more work for you via layer fusion, pointwise fusion, and Q/DQ fusion
Tuning specific layer types and aligning tensors for Tensor Core acceleration
Advanced topics: deterministic tactic selection, Python performance, and the accuracy ↔ performance tradeoff (see Accuracy Considerations)
Cutting engine build time with timing caches and builder optimization levels
Each section is independent, so you can jump straight to the techniques most relevant to the bottlenecks your benchmarks surfaced.
Global Performance Tuning#
As an orthogonal optimization method the Global Performance Tuning chapter covers automated build-route search when you need another engine-level performance pass and can accept additional build time:
Querying and setting internal builder knobs and build routes through the C++/Python APIs
Running accuracy-aware tuning sweeps with
trtexecflags such as--tuneBuildRoutes,--tuningSearch, and--tuningCacheFileInterpreting tuning results and the caveats that apply to non-default build routes
See also
- How TensorRT Works
Architecture concepts (builder, runtime, I/O pipeline) that underpin the techniques in these chapters.