Glossary#
This glossary defines key terms and concepts used throughout the TensorRT documentation. Terms are organized alphabetically for quick reference.
How to Use This Glossary:
Use Ctrl+F (or Cmd+F on Mac) to search for specific terms
Terms are organized by their first letter
Related terms are cross-referenced with italics
Technical API references are linked where applicable
—
A
Activation - The output tensor of a layer in a neural network. Activations are intermediate results that flow between layers during inference.
B
Batch - a batch is a collection of inputs that can all be processed uniformly. Each instance in the batch has the same shape and flows through the network similarly. All instances can, therefore, be computed in parallel.
Builder - TensorRT’s model optimizer. The builder takes a network definition as input, performs device-independent and device-specific optimizations, and creates an engine. For more information, refer to the Builder API.
C
Calibration - The process of determining optimal quantization scaling factors for INT8 inference. Calibration uses representative input data to minimize accuracy loss when converting from higher precision to INT8.
CUDA Graph - A CUDA feature that allows the GPU to launch a sequence of kernels without CPU involvement, reducing launch overhead and improving performance for inference workloads.
D
Data-Dependent Shape - A tensor shape with a dynamic dimension not calculated solely from network input dimensions and shape tensors.
Device - A specific GPU. Two GPUs are considered identical devices if they have the same model name and configuration.
DLA - Deep Learning Accelerator. A dedicated inference processor on NVIDIA embedded platforms designed for efficient execution of neural networks with lower power consumption than GPU execution.
Dynamic batch - A mode of inference deployment where the batch size is unknown until runtime. Historically, TensorRT treated batch size as a special dimension and the only configurable dimension at runtime. TensorRT 6 and later allow engines to be built such that all dimensions of inputs can be adjusted at runtime.
Dynamic Shape - A tensor dimension whose value is not known until runtime. Dynamic shapes allow a single engine to handle inputs of varying sizes. See also Optimization Profile.
E
Engine - A representation of a model optimized by the TensorRT builder. An engine contains the optimized network, kernel selections, and memory management strategy. Engines can be serialized to a plan file for later use. For more information, refer to the Execution API.
Execution Context - A runtime instance that maintains state for executing inference with a TensorRT engine. Multiple contexts can share the same engine, enabling concurrent inference on the same model.
Explicitly Data-Dependent Shape - A tensor shape that depends on the dimensions of an output of INonZeroLayer or INMSLayer.
F
Framework Integration - Integration of TensorRT into deep learning frameworks such as PyTorch or TensorFlow, allowing model optimization and inference within the framework. Examples include Torch-TensorRT for PyTorch.
Fusion - An optimization technique where TensorRT combines multiple layers into a single operation to reduce memory bandwidth and kernel launch overhead. For example, convolution, bias, and activation layers might be fused into a single kernel.
I
Implicitly Data-Dependent Shape - A tensor shape with a dynamic dimension calculated from data other than network input dimensions, network input shape tensors, or an explicitly data-dependent shape (see Explicitly Data-Dependent Shape). For example, a shape with a dimension calculated from data output by a convolution.
INT8 - An 8-bit integer data type used for quantized inference. INT8 operations can significantly improve inference performance while maintaining acceptable accuracy when properly calibrated.
K
Kernel - A GPU function that performs a specific computation. TensorRT selects and tunes kernels for each layer during the build phase to optimize performance for the target hardware.
L
Layer - A single operation in a neural network, such as convolution, activation, or pooling. In TensorRT, layers are represented in the network definition and optimized into efficient GPU kernels.
N
Network Definition - A representation of a model in TensorRT before optimization. A network definition is a graph of tensors and operators that describes the model’s structure and connectivity.
O
ONNX - Open Neural Network eXchange. A framework-independent standard for representing machine learning models. TensorRT uses ONNX as the primary format for importing models from training frameworks. For more information, refer to onnx.ai.
ONNX Parser - A parser for creating a TensorRT network definition from an ONNX model. For more details on the C++ ONNX Parser, refer to the NvONNXParser or the Python ONNX Parser.
Optimization Profile - A set of dimensions (minimum, optimal, and maximum) specified for each dynamic input dimension. TensorRT uses profiles to optimize engines for specific ranges of input shapes.
P
Plan - An optimized inference engine in a serialized format. Applications deserialize the model from the plan file to initialize the inference engine. A typical workflow builds an engine once and then serializes it as a plan file for later use.
Platform - A combination of hardware architecture and operating system. Examples include Linux on x86-64 and QNX on AArch64. Platforms with different architectures or operating systems are considered distinct platforms. Plan files are generally platform-specific.
Plugin - A custom layer implementation that extends TensorRT’s built-in layer support. Plugins allow you to implement operations not natively supported by TensorRT.
Precision - The numerical format used to represent values during computation. TensorRT supports mixed-precision inference with FP32, TF32, FP16, BF16, FP8, INT8, and INT4 data types. The builder selects precisions to balance accuracy and performance based on specified constraints.
PTQ - Post-Training Quantization. A method to quantize a trained model to lower precision (typically INT8) without retraining. PTQ uses calibration data to determine quantization parameters.
Q
QAT - Quantization-Aware Training. A training technique that simulates quantization during training, allowing the model to learn parameters that are more robust to quantization. QAT typically achieves better accuracy than PTQ for INT8 inference.
Quantization - The process of converting a model from higher precision (FP32/FP16) to lower precision (INT8/INT4) to reduce memory usage and improve performance. See also PTQ and QAT.
R
Refit - The process of updating weights in an existing TensorRT engine without rebuilding the entire engine. Refitting is faster than rebuilding and useful for updating model parameters without changing the network structure.
Runtime - The component of TensorRT that executes inference on a TensorRT engine. The runtime API supports synchronous and asynchronous execution, profiling, and management of execution contexts.
S
Safety Runtime - A version of the TensorRT runtime certified for use in safety-critical applications, particularly in automotive systems with ISO 26262 requirements.
Shape Tensor - A tensor whose values represent the dimensions of another tensor. Shape tensors enable dynamic control of tensor shapes during inference.
Strongly Typed Network - A network where tensor types are explicitly specified and strictly enforced, rather than being automatically selected by TensorRT for performance.
T
Tactic - A specific implementation choice for a layer, including kernel selection, tile sizes, and algorithm variants. The builder evaluates tactics and selects the fastest option for the target hardware.
Tensor - A multi-dimensional array of data. In TensorRT, tensors represent inputs, outputs, and intermediate activations in a neural network.
Timing Cache - A cache of measured kernel performance data. Timing caches can be serialized and reused to speed up subsequent engine builds with similar network structures.
V
Version Compatibility - A TensorRT feature that allows engines built with one TensorRT version to run on later versions, providing forward compatibility across TensorRT releases.
W
Weight - Learned parameters in a neural network, such as convolution filters or fully connected layer matrices. Weights are optimized during training and remain constant during inference.
Workspace - Temporary GPU memory used by TensorRT during engine building and inference. The workspace size can be controlled to balance performance and memory usage.