Command-Line Programs#

trtexec#

Included in the samples directory is a command-line wrapper tool called trtexec. trtexec is a tool that can quickly utilize TensorRT without developing your application. The trtexec tool has three main purposes:

  • It’s useful for benchmarking networks on random or user-provided input data.

  • It’s useful for generating serialized engines from models.

  • It’s useful for generating a serialized timing cache from the builder.

Benchmarking Network#

If you have a model saved as an ONNX file, you can use the trtexec tool to test the performance of running inference on your network using TensorRT. The trtexec tool has many options for specifying inputs and outputs, iterations for performance timing, precision allowed, and other options.

To maximize GPU utilization, trtexec enqueues the inferences one batch ahead of time. In other words, it does the following:

enqueue batch 0 -> enqueue batch 1 -> wait until batch 0 is done -> enqueue batch 2 -> wait until batch 1 is done -> enqueue batch 3 -> wait until batch 2 is done -> enqueue batch 4 -> ...

If Cross-Inference Multi-Streaming (--infStreams=N flag) is used, trtexec follows this pattern on each stream separately.

The trtexec tool prints the following performance metrics. The following figure shows an example of an Nsight System profile of a trtexec run with markers showing each performance metric.

  • Throughput: The observed throughput is computed by dividing the number of inferences by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be underutilized because of host-side overheads or data transfers. CUDA graphs (with --useCudaGraph) or disabling H2D/D2H transfers (with --noDataTransfer) may improve GPU utilization. The output log guides which flag to use when trtexec detects that the GPU is underutilized.

  • Host Latency: The summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single inference.

  • Enqueue Time: The host latency to enqueue an inference, including calling H2D/D2H CUDA APIs, running host-side heuristics, and launching CUDA kernels. If this is longer than the GPU Compute Time, the GPU may be underutilized, and the throughput may be dominated by host-side overhead. Using CUDA graphs (with --useCudaGraph) may reduce Enqueue Time.

  • H2D Latency: The latency for host-to-device data transfers for input tensors of a single inference. Add --noDataTransfer to disable H2D/D2H data transfers.

  • D2H Latency: The latency for device-to-host data transfers for output tensors of a single inference. Add --noDataTransfer to disable H2D/D2H data transfers.

  • GPU Compute Time: The GPU latency to execute the CUDA kernels for an inference.

  • Total Host Walltime: The Host Walltime from when the first inference (after warm-ups) is enqueued to when the last inference was completed.

  • Total GPU Compute Time: The summation of the GPU Compute Time of all the inferences. If this is significantly shorter than the Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.

    Note

    In the latest Nsight Systems, the GPU rows appear above the CPU rows rather than beneath them.

Performance Metrics in a Normal trtexec Run under Nsight Systems

Add the --dumpProfile flag to trtexec to show per-layer performance profiles, which allows users to understand which layers in the network take the most time in GPU execution. The per-layer performance profiling also works with launching inference as a CUDA graph. In addition, build the engine with the --profilingVerbosity=detailed flag and add the --dumpLayerInfo flag to show detailed engine information, including per-layer detail and binding information. This allows you to understand which operation each layer in the engine corresponds to and their parameters.

Serialized Engine Generation#

If you generate a saved serialized engine file, you can pull it into another inference application. For example, you can use the NVIDIA Triton Inference Server to run the engine with multiple execution contexts from multiple threads in a fully pipelined asynchronous way to test parallel inference performance. There are some caveats; for example, in INT8 mode, trtexec sets random dynamic ranges for tensors unless the calibration cache file is provided with the --calib=<file> flag, so the resulting accuracy will not be as expected.

Serialized Timing Cache Generation#

If you provide a timing cache file to the --timingCacheFile option, the builder can load existing profiling data from it and add new profiling data entries during layer profiling. The timing cache file can be reused in other builder instances to improve the execution time. It is suggested that this cache be reused only in the same hardware/software configurations (for example, CUDA/cuDNN/TensorRT versions, device model, and clock frequency); otherwise, functional or performance issues may occur.

Commonly Used Command-Line Flags#

The section lists the commonly used trtexec command-line flags.

Flags for the Build Phase

  • --onnx=<model>: Specify the input ONNX model.

  • If the input model is in ONNX format, use the --minShapes, --optShapes, and --maxShapes flags to control the range of input shapes, including batch size.

  • --minShapes=<shapes>, --optShapes=<shapes>, and --maxShapes=<shapes>: Specify the range of the input shapes to build the engine with. Only required if the input model is in ONNX format.

  • –-memPoolSize=<pool_spec>: Specify the maximum size of the workspace tactics allowed to be used and the sizes of the memory pools that DLA will allocate per loadable. Supported pool types include workspace, dlaSRAM, dlaLocalDRAM, dlaGlobalDRAM, and tacticSharedMem.

  • --saveEngine=<file>: Specify the path to save the engine.

  • --fp16, --bf16,``–int8``, --fp8,``–noTF32``, and --best: Specify network-level precision.

  • --stronglyTyped: Create a strongly typed network.

  • --sparsity=[disable|enable|force]: Specify whether to use tactics that support structured sparsity.

    • disable: Disable all tactics using structured sparsity. This is the default.

    • enable: Enable tactics using structured sparsity. Tactics will only be used if the ONNX file weights meet the structured sparsity requirements.

    • force: Enable tactics using structured sparsity and allow trtexec to overwrite the weights in the ONNX file to enforce them to have structured sparsity patterns. Note that the accuracy is not preserved, so this is only to get inference performance.

    Note

    This has been deprecated. Use Polygraphy (polygraphy surgeon prune) to rewrite the weights of ONNX models to a structured-sparsity pattern and then run with --sparsity=enable.

  • --timingCacheFile=<file>: Specify the timing cache to load from and save to.

  • --noCompilationCache: Disable the compilation cache in the builder, which is part of the timing cache (the default is to enable the compilation cache).

  • --verbose: Turn on verbose logging.

  • --skipInference: Build and save the engine without running inference.

  • --profilingVerbosity=[layer_names_only|detailed|none]: Specify the profiling verbosity to build the engine.

  • --dumpLayerInfo, --exportLayerInfo=<file>: Print/Save the layer information of the engine.

  • --precisionConstraints=spec: Control precision constraint setting.

    • none: No constraints.

    • prefer: Meet precision constraints set by --layerPrecisions or --layerOutputTypes if possible.

    • obey: Meet precision constraints set by --layerPrecisions or --layerOutputTypes or fail otherwise.

  • --layerPrecisions=spec: Control per-layer precision constraints. Effective only when precisionConstraints are set to obey or prefer. The specs are read left to right and later override earlier ones. “*” can be used as a layerName to specify the default precision for all the unspecified layers.

    • For example: --layerPrecisions=*:fp16,layer_1:fp32 sets the precision of all layers to FP16 except for layer_1, which will be set to FP32.

  • --layerOutputTypes=spec: Control per-layer output type constraints. Effective only when precisionConstraints are set to obey or prefer. The specs are read left to right and later override earlier ones. “*” can be used as a layerName to specify the default precision for all the unspecified layers. If a layer has more than one output, then multiple types separated by “+” can be provided for this layer.

    • For example: --layerOutputTypes=*:fp16,layer_1:fp32+fp16 sets the precision of all layer outputs to FP16 except for layer_1, whose first output will be set to FP32 and whose second output will be set to FP16.

  • --layerDeviceTypes=spec: Explicitly set per-layer device type to GPU or DLA. The specs are read left to right and later override earlier ones.

  • -–useDLACore=N: Use the specified DLA core for layers that support DLA.

  • –-allowGPUFallback: Allow layers unsupported on DLA to run on GPU instead.

  • --versionCompatible, --vc: Enable version-compatible mode for engine build and inference. Any engine built with this flag enabled is compatible with newer versions of TensorRT on the same host OS when run with TensorRT’s dispatch and lean runtimes. Only supported with explicit batch mode.

  • --excludeLeanRuntime: When --versionCompatible is enabled, this flag indicates that the generated engine should not include an embedded lean runtime. If this is set, you must explicitly specify a valid lean runtime when loading the engine. Only supported with explicit batch and weights within the engine.

  • --tempdir=<dir>: Overrides the default temporary directory TensorRT will use when creating temporary files. Refer to the IRuntime::setTemporaryDirectory API documentation for more information.

  • --tempfileControls=controls: Controls what TensorRT can use when creating temporary executable files. It should be a comma-separated list with entries in the format [in_memory|temporary]:[allow|deny].

    • Options include:

      • in_memory: Controls whether TensorRT can create temporary in-memory executable files.

      • temporary: Controls whether TensorRT can create temporary executable files in the filesystem (in the directory given by --tempdir).

    • Example usage:

      --tempfileControls=in_memory:allow,temporary:deny
      
  • --dynamicPlugins=<file>: Load the plugin library dynamically and serialize it with the engine when included in --setPluginsToSerialize (can be specified multiple times).

  • --setPluginsToSerialize=<file>: Set the plugin library to be serialized with the engine (can be specified multiple times).

  • --builderOptimizationLevel=N: Set the builder optimization level to build the engine with. A higher level allows TensorRT to spend more building time for more optimization options.

  • --maxAuxStreams=N: Set the maximum number of auxiliary streams per inference stream that TensorRT can use to run kernels in parallel if the network contains ops that can run in parallel, with the cost of more memory usage. Set this to 0 for optimal memory usage. Refer to the Within-Inference Multi-Streaming section for more information.

  • --stripWeights: Strip weights from the plan. This flag works with either refit or refit with identical weights. It defaults to refit with identical weights; however, you can switch to refit by enabling both stripWeights and refit simultaneously.

  • --markDebug: Specify a list of tensor names to be marked as debug tensors. Separate names with a comma.

  • --allowWeightStreaming: Enables an engine that can stream its weights. Must be specified with --stronglyTyped. TensorRT will automatically choose the appropriate weight streaming budget at runtime to ensure model execution. A specific amount can be set with --weightStreamingBudget.

Flags for the Inference Phase

  • --loadEngine=<file>: Load the engine from a serialized plan file instead of building it from the input ONNX model.

  • --asyncFileReader=<file>: Load a serialized engine using an async stream reader. This method uses the IStreamReaderV2 interface.

  • If the input model is in ONNX format or the engine is built with explicit batch dimensions, use --shapes instead.

  • --shapes=<shapes>: Specify the input shapes to run the inference with.

  • --loadInputs=<specs>: Load input values from files. The default is to generate random inputs.

  • --warmUp=<duration in ms>, --duration=<duration in seconds>, --iterations=<N>: Specify the minimum duration of the warm-up runs, the minimum duration for the inference runs, and the minimum iterations of the inference runs. For example, setting --warmUp=0 --duration=0 --iterations=N allows you to control exactly how many iterations to run the inference for.

  • --useCudaGraph: Capture the inference to a CUDA Graph and run inference by launching the graph. This argument may be ignored when the built TensorRT engine contains operations not permitted under CUDA Graph capture mode.

  • --noDataTransfers: Turn off host to device and device-to-host data transfers.

  • --useSpinWait: Actively synchronize on GPU events. This option makes latency measurement more stable but increases CPU usage and power.

  • --infStreams=<N>: Run inference with multiple cross-inference streams in parallel. Refer to the Cross-Inference Multi-Streaming section for more information.

  • --verbose: Turn on verbose logging.

  • --dumpProfile, --exportProfile=<file>: Print/Save the per-layer performance profile.

  • --dumpLayerInfo, --exportLayerInfo=<file>: Print layer information of the engine.

  • --profilingVerbosity=[layer_names_only|detailed|none]: Specify the profiling verbosity to run the inference.

  • --useRuntime=[full|lean|dispatch]: TensorRT runtime to execute engine. lean and dispatch require --versionCompatible to be enabled and are used to load a VC engine. All engines (VC or not) must be built with full runtime.

  • --leanDLLPath=<file>: External lean runtime DLL to use in version-compatible mode. Requires --useRuntime=[lean|dispatch].

  • --dynamicPlugins=<file>: Load the plugin library dynamically when the library is not included in the engine plan file (can be specified multiple times).

  • --getPlanVersionOnly: Print the TensorRT version when the loaded plan is created. Works without deserialization of the plan. Use together with --loadEngine. Supported only for engines created with 8.6 and later.

  • --saveDebugTensors: Specify a list of tensor names to turn on the debug state and a filename to save raw outputs. These tensors must be specified as debug tensors during build time.

  • --allocationStrategy: Specify how the internal device memory for inference is allocated. You can choose from static, profile, and runtime. The first option is the default behavior that pre-allocates enough size for all profiles and input shapes. The second option enables trtexec to allocate only what’s required for the profile to use. The third option enables trtexec to allocate only what’s required for the actual input shapes.

  • --weightStreamingBudget: Manually set the weight streaming budget. Base-2 unit suffixes are supported: B (Bytes), G (Gibibytes), K (Kibibytes), and M (Mebibytes). If the weights don’t fit on the device, a value of 0 will choose the minimum possible budget. A value of -1 will disable weight streaming at runtime.

Refer to trtexec --help with all the supported flags and detailed explanations.

Refer to the GitHub: trtexec/README.md file for detailed information about building this tool and examples of its usage.