Best Practices#

This guide presents the TensorRT performance workflow as a measure-then-optimize loop: benchmarking (measuring what your model actually does), optimization (changing what your model does so it runs faster), and optional global performance tuning (automated build-route search when you need another engine-level pass). Treat benchmarking and optimization as the core feedback loop: measure first, optimize, then measure again to confirm the change had the impact you expected.

Benchmarking#

Benchmarking is how you turn a TensorRT model into trustworthy numbers (latency, throughput, per-layer cost) that you can compare across builds, hardware, and configurations. The Performance Benchmarking chapter walks through:

  • Running trtexec against ONNX models, quantized ONNX models, and serialized engines

  • Running Torch-TRT against trained PyTorch models, quantized PyTorch models, and measuring performance

  • Measuring latency and throughput with wall-clock timers and CUDA events, and tracking memory usage

  • Profiling per-layer performance with TensorRT’s built-in profiler, Nsight Deep Learning Designer, NVIDIA Nsight Systems, and DLA-specific tooling

  • Controlling the hardware/software environment (GPU clocks, power and thermal throttling, PCIe transfers, driver mode, sync mode) so your numbers are stable and reproducible

Without a stable measurement baseline, every optimization you try is a guess. Start here.

Optimization#

Once you trust your numbers, the Optimizing TensorRT Performance chapter covers the techniques you can apply to push them further:

  • Increasing parallelism with batching, CUDA graphs, and within-/cross-inference multi-streaming

  • Letting the builder do more work for you via layer fusion, pointwise fusion, and Q/DQ fusion

  • Tuning specific layer types and aligning tensors for Tensor Core acceleration

  • Advanced topics: deterministic tactic selection, Python performance, and the accuracy ↔ performance tradeoff (see Accuracy Considerations)

  • Cutting engine build time with timing caches and builder optimization levels

Each section is independent, so you can jump straight to the techniques most relevant to the bottlenecks your benchmarks surfaced.

Global Performance Tuning#

As an orthogonal optimization method the Global Performance Tuning chapter covers automated build-route search when you need another engine-level performance pass and can accept additional build time:

  • Querying and setting internal builder knobs and build routes through the C++/Python APIs

  • Running accuracy-aware tuning sweeps with trtexec flags such as --tuneBuildRoutes, --tuningSearch, and --tuningCacheFile

  • Interpreting tuning results and the caveats that apply to non-default build routes

See also

How TensorRT Works

Architecture concepts (builder, runtime, I/O pipeline) that underpin the techniques in these chapters.