Glossary#

B

Batch - a batch is a collection of inputs that can all be processed uniformly. Each instance in the batch has the same shape and flows through the network similarly. All instances can, therefore, be computed in parallel.

Builder - TensorRT’s model optimizer. The builder takes a network definition as input, performs device-independent and device-specific optimizations, and creates an engine. For more information about the builder, refer to the Builder API.

D

Data-Dependent Shape - A tensor shape with a dynamic dimension not calculated solely from network input dimensions and shape tensors.

Device - A specific GPU. Two GPUs are considered identical devices if they have the same model name and configuration.

Dynamic batch - A mode of inference deployment where the batch size is unknown until runtime. Historically, TensorRT treated batch size as a special dimension and the only configurable dimension at runtime. TensorRT 6 and later allow engines to be built such that all dimensions of inputs can be adjusted at runtime.

E

Engine - A representation of a model optimized by the TensorRT builder. For more information about the engine, refer to the Execution API.

Explicitly Data-Dependent Shape - A tensor shape that depends on the dimensions of an output of INonZeroLayer or INMSLayer.

F

Framework integration - Integration of TensorRT into a framework such as PyTorch, allowing model optimization and inference within the framework.

I

Implicitly Data-Dependent Shape - A tensor shape with a dynamic dimension calculated from data other than network input dimensions, network input shape tensors, or an explicitly data-dependent shape (see Explicitly Data-Dependent Shape). For example, a shape with a dimension calculated from data output by a convolution.

N

Network definition - A representation of a model in TensorRT. A network definition is a graph of tensors and operators.

O

ONNX - Open Neural Network eXchange. A framework-independent standard for representing machine learning models. For more information about ONNX, refer to onnx.ai.

ONNX parser - A parser for creating a TensorRT network definition from an ONNX model. For more details on the C++ ONNX Parser, refer to the NvONNXParser or the Python ONNX Parser.

P

Plan - An optimized inference engine in a serialized format. The application will first deserialize the model from the plan file to initialize the inference engine. A typical application will build an engine once and then serialize it as a plan file for later use.

Platform - A combination of architecture and operating system. Examples of these platforms are Linux on x86 and QNX Standard on AArch64. Platforms with different architectures or different operating systems are considered different platforms.

Precision - Refers to the numerical format used to represent values in a computational method. This option is specified as part of the TensorRT build step. TensorRT supports mixed precision inference with FP32, TF32, FP16, or INT8 precisions. Devices before NVIDIA Ampere Architecture default to FP32. NVIDIA Ampere Architecture and later devices default to TF32, a fast format using FP32 storage with lower-precision math.

R

Runtime - The component of TensorRT that performs inference on a TensorRT engine. The runtime API supports synchronous and asynchronous execution, profiling, enumeration, and querying of the bindings for engine inputs and outputs.