Command-Line Programs#

trtexec#

Included in the samples directory is a command-line wrapper tool called trtexec. trtexec is a tool that can quickly utilize TensorRT without developing your application. The trtexec tool has three main purposes:

  • It’s useful for benchmarking networks on random or user-provided input data.

  • It’s useful for generating serialized engines from models.

  • It’s useful for generating a serialized timing cache from the builder.

Benchmarking Network#

If you have a model saved as an ONNX file, you can use the trtexec tool to test the performance of running inference on your network using TensorRT. The trtexec tool has many options for specifying inputs and outputs, iterations for performance timing, precision allowed, and other options.

To maximize GPU utilization, trtexec enqueues the inferences one batch ahead of time. In other words, it does the following:

enqueue batch 0 -> enqueue batch 1 -> wait until batch 0 is done -> enqueue batch 2 -> wait until batch 1 is done -> enqueue batch 3 -> wait until batch 2 is done -> enqueue batch 4 -> ...

If Cross-Inference Multi-Streaming (--infStreams=N flag) is used, trtexec follows this pattern on each stream separately.

The trtexec tool prints the following performance metrics. The following figure shows an example of an Nsight System profile of a trtexec run with markers showing each performance metric.

  • Throughput: The observed throughput is computed by dividing the number of inferences by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be underutilized because of host-side overheads or data transfers. CUDA graphs (with --useCudaGraph) or disabling H2D/D2H transfers (with --noDataTransfer) may improve GPU utilization. The output log guides which flag to use when trtexec detects that the GPU is underutilized.

  • Host Latency: The summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single inference.

  • Enqueue Time: The host latency to enqueue an inference, including calling H2D/D2H CUDA APIs, running host-side heuristics, and launching CUDA kernels. If this is longer than the GPU Compute Time, the GPU may be underutilized, and the throughput may be dominated by host-side overhead. Using CUDA graphs (with --useCudaGraph) may reduce Enqueue Time.

  • H2D Latency: The latency for host-to-device data transfers for input tensors of a single inference. Add --noDataTransfer to disable H2D/D2H data transfers.

  • D2H Latency: The latency for device-to-host data transfers for output tensors of a single inference. Add --noDataTransfer to disable H2D/D2H data transfers.

  • GPU Compute Time: The GPU latency to execute the CUDA kernels for an inference.

  • Total Host Walltime: The Host Walltime from when the first inference (after warm-ups) is enqueued to when the last inference was completed.

  • Total GPU Compute Time: The summation of the GPU Compute Time of all the inferences. If this is significantly shorter than the Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.

Note

In the latest Nsight Systems, the GPU rows appear above the CPU rows rather than beneath them.

Performance Metrics in a Normal trtexec Run under Nsight Systems

Add the --dumpProfile flag to trtexec to show per-layer performance profiles, which allows users to understand which layers in the network take the most time in GPU execution. The per-layer performance profiling also works with launching inference as a CUDA graph. In addition, build the engine with the --profilingVerbosity=detailed flag and add the --dumpLayerInfo flag to show detailed engine information, including per-layer detail and binding information. This allows you to understand which operation each layer in the engine corresponds to and their parameters.

Serialized Engine Generation#

If you generate a saved serialized engine file, you can pull it into another inference application. For example, you can use the NVIDIA Triton Inference Server to run the engine with multiple execution contexts from multiple threads in a fully pipelined asynchronous way to test parallel inference performance. There are some caveats; for example, in INT8 mode, trtexec sets random dynamic ranges for tensors unless the calibration cache file is provided with the --calib=<file> flag, so the resulting accuracy will not be as expected.

Serialized Timing Cache Generation#

If you provide a timing cache file to the --timingCacheFile option, the builder can load existing profiling data from it and add new profiling data entries during layer profiling. The timing cache file can be reused in other builder instances to improve the execution time. This cache is suggested to be reused only in the same hardware/software configurations (for example, CUDA/cuDNN/TensorRT versions, device model, and clock frequency); otherwise, functional or performance issues may occur.

Commonly Used Command-Line Flags#

The section lists the commonly used trtexec command-line flags.

Flags for the Build Phase#

Flag for the Build Phase

Description

–-allowGPUFallback

Allow layers unsupported on DLA to run on GPU instead.

--allowWeightStreaming

Enables an engine that can stream its weights. Must be specified with --stronglyTyped. TensorRT will automatically choose the appropriate weight streaming budget at runtime to ensure model execution. A specific amount can be set with --weightStreamingBudget.

--builderOptimizationLevel=N

Set the builder optimization level to use when building the engine. A higher level allows TensorRT to spend more building time on more optimization options.

--dumpLayerInfo, --exportLayerInfo=<file>

Print and save the engine layer information.

--dynamicPlugins=<file>

Load the plugin library dynamically and serialize it with the engine when included in --setPluginsToSerialize (can be specified multiple times).

--excludeLeanRuntime

When --versionCompatible is enabled, this flag indicates that the generated engine should not include an embedded lean runtime. If this is set, you must explicitly specify a valid lean runtime when loading the engine. Only supported with explicit batch and weights within the engine.

--fp16, --bf16, --int8, --fp8, --noTF32, and --best

Specify network-level precision.

--layerDeviceTypes=spec

Explicitly set the per-layer device type to GPU or DLA. The specs are read left to right and later override earlier ones.

--layerOutputTypes=spec

Control per-layer output type constraints. Effective only when precisionConstraints are set to obey or prefer. The specs are read left to right and later override earlier ones. “*” can be used as a layerName to specify the default precision for all the unspecified layers. If a layer has more than one output, then multiple types separated by “+” can be provided for this layer. For example, --layerOutputTypes=*:fp16,layer_1:fp32+fp16 sets the precision of all layer outputs to FP16 except for layer_1, whose first output will be set to FP32 and whose second output will be set to fp16.

--layerPrecisions=spec

Control per-layer precision constraints. This is effective only when precisionConstraints are set to obey or prefer. The specs are read left to right and later override earlier ones. “*” can be used as a layerName to specify the default precision for all the unspecified layers. For example, --layerPrecisions=*:fp16,layer_1:fp32 sets the precision of all layers to FP16 except for layer_1, which will be set to fp32.

--markDebug

Specify a list of tensor names to be marked as debug tensors. Separate names with a comma.

--maxAuxStreams=N

Set the maximum number of auxiliary streams per inference stream that TensorRT can use to run kernels in parallel if the network contains ops that can run in parallel, with the cost of more memory usage. Set this to 0 for optimal memory usage. Refer to the Within-Inference Multi-Streaming section for more information.

–-memPoolSize=<pool_spec>

Specify the maximum size of the workspace tactics allowed to be used and the sizes of the memory pools that DLA will allocate per loadable. Supported pool types include workspace, dlaSRAM, dlaLocalDRAM, dlaGlobalDRAM, and tacticSharedMem.

--minShapes=<shapes>, --optShapes=<shapes>, and --maxShapes=<shapes>

Specify the range of input shapes with which to build the engine. This is only required if the input model is in ONNX format.

--noCompilationCache

Disable the compilation cache in the builder, which is part of the timing cache (the default is to enable the compilation cache).

--onnx=<model>

Specify the input ONNX model. If the input model is in ONNX format, use the --minShapes, --optShapes, and --maxShapes flags to control the range of input shapes, including batch size.

--precisionConstraints=spec

Control precision constraint setting.

  • none: No constraints.

  • prefer: Meet precision constraints set by --layerPrecisions or --layerOutputTypes if possible.

  • obey: Meet precision constraints set by --layerPrecisions or --layerOutputTypes or fail otherwise.

--profilingVerbosity=[layer_names_only|detailed|none]

Specify the profiling verbosity to build the engine.

--saveEngine=<file>

Specify the path to save the engine.

--setPluginsToSerialize=<file>

Set the plugin library to be serialized with the engine (can be specified multiple times).

--skipInference

Build and save the engine without running inference.

--sparsity=[disable|enable|force]

Specify whether to use tactics that support structured sparsity.

  • disable: Disable all tactics using structured sparsity. This is the default.

  • enable: Enable tactics using structured sparsity. Tactics will only be used if the ONNX file weights meet the structured sparsity requirements.

  • force: Enable tactics using structured sparsity and allow trtexec to overwrite the weights in the ONNX file to enforce them to have structured sparsity patterns. Note that the accuracy is not preserved, so this is only to get inference performance.

Note

This has been deprecated. Use Polygraphy (polygraphy surgeon prune) to rewrite the weights of ONNX models to a structured-sparsity pattern and then run with --sparsity=enable.

--stripWeights

Strip weights from the plan. This flag works with either refit or refit with identical weights. It defaults to refit with identical weights; however, you can switch to refit by enabling both stripWeights and refit simultaneously.

--stronglyTyped

Create a strongly typed network.

--tempdir=<dir>

This option overrides TensorRT’s default temporary directory when creating temporary files. For more information, refer to the IRuntime::setTemporaryDirectory API documentation.

--tempfileControls=controls

Controls what TensorRT can use when creating temporary executable files. It should be a comma-separated list with entries in the format [in_memory|temporary]:[allow|deny].

  • Options include:

    • in_memory: Controls whether TensorRT can create temporary in-memory executable files.

    • temporary: Controls whether TensorRT can create temporary executable files in the filesystem (in the directory given by --tempdir).

  • Example usage:

    --tempfileControls=in_memory:allow,temporary:deny
    

--timingCacheFile=<file>

Specify the timing cache to load from and save to.

-–useDLACore=N

Use the specified DLA core for layers that support DLA.

--verbose

Enable version-compatible mode for engine build and inference. Any engine built with this flag enabled is compatible with newer versions of TensorRT on the same host OS when run with TensorRT’s dispatch and lean runtimes. Only supported with explicit batch mode.

Flags for the Inference Phase#

Flag for the Inference Phase

Description

--allocationStrategy

Specify how the internal device memory for inference is allocated. You can choose from static, profile, and runtime. The first option is the default behavior that pre-allocates enough size for all profiles and input shapes. The second option enables trtexec to allocate only what’s required for the profile to use. The third option enables trtexec to allocate only what’s required for the actual input shapes.

--asyncFileReader=<file>

Load a serialized engine using an async stream reader. This method uses the IStreamReaderV2 interface.

--dumpLayerInfo, --exportLayerInfo=<file>

Print layer information of the engine.

--dumpProfile, --exportProfile=<file>

Print and save the per-layer performance profile.

--dynamicPlugins=<file>

Load the plugin library dynamically when not included in the engine plan file (it can be specified multiple times).

--getPlanVersionOnly

Print the TensorRT version when the loaded plan is created. This works without deserializing the plan. Use it together with --loadEngine. It is supported only for engines created with 8.6 and later.

--infStreams=<N>

Run inference with multiple cross-inference streams in parallel. Refer to the Cross-Inference Multi-Streaming section for more information.

--leanDLLPath=<file>

External lean runtime DLL is to be used in version-compatible mode. Requires --useRuntime=[lean|dispatch].

--loadEngine=<file>

Load the engine from a serialized plan file instead of building it from the input ONNX model.

--loadInputs=<specs>

Load input values from files. The default is to generate random inputs.

--noDataTransfers

Turn off host-to-device and device-to-host data transfers.

--profilingVerbosity=[layer_names_only|detailed|none]

Specify the profiling verbosity to run the inference.

--saveDebugTensors

Specify a list of tensor names to turn on the debug state and a filename to save raw outputs. These tensors must be specified as debug tensors during build time.

--shapes=<shapes>

Specify the input shapes with which to run the inference. If the input model is in ONNX format or the engine is built with explicit batch dimensions, use --shapes instead.

--warmUp=<duration in ms>, --duration=<duration in seconds>, --iterations=<N>

Specify the minimum duration of the warm-up runs, the minimum duration for the inference runs, and the minimum iterations. For example, setting --warmUp=0 --duration=0 --iterations=N allows you to control exactly how many iterations to run the inference for.

--weightStreamingBudget

Manually set the weight streaming budget. Base-2 unit suffixes are supported: B (Bytes), G (Gibibytes), K (Kibibytes), and M (Mebibytes). If the weights don’t fit on the device, a value of 0 will choose the minimum possible budget. A value of -1 will disable weight streaming at runtime.

--useCudaGraph

Capture the inference to a CUDA Graph and run the inference by launching the graph. This argument may be ignored when the built TensorRT engine contains operations not permitted under CUDA Graph capture mode.

--useRuntime=[full|lean|dispatch]

TensorRT runtime to execute engine. lean and dispatch require --versionCompatible to be enabled and are used to load a VC engine. All engines (VC or not) must be built with full runtime.

--useSpinWait

Actively synchronize on GPU events. This option makes latency measurement more stable but increases CPU usage and power.

--verbose

Turn on verbose logging.

Refer to trtexec --help with all the supported flags and detailed explanations.

Refer to the GitHub: trtexec/README.md file for detailed information about building this tool and examples of its usage.