Building an ONNX Engine

Ensure you can log into TensorRT-Cloud.

Important

Building on-demand engines is provided as a closed Early Access (EA) product. Access is restricted and is provided upon request (refer to Getting TensorRT-Cloud Access). These features will not be functional unless access is granted.

ONNX support is provided through on-demand engine building. You must provide your ONNX file and the engine config you want to target, and TensorRT-Cloud will generate a corresponding engine.

Using trtexec args, you can generate fully customizable engines. Additionally, the TensorRT-Cloud CLI provides utility flags for building weight-stripped engines. In short, building weight-stripped engines reduces the engine binary size at a potential performance cost.

In the sections below, we provide examples for building different kinds of engines.

Currently, only the latest version of TensorRT 10.0 is supported.

Specifying an Engine Build Configuration

The TensorRT-Cloud CLI trt-cloud build command provides multiple arguments. To see the full list of arguments, run:

trt-cloud build onnx -h

Key arguments that allow for system and engine configuration are:

  • --gpu - Picks GPU target. Use trt-cloud info to get the list of available GPUs.

  • --os - Picks OS target (linux or windows)

  • --trtexec-args - Sets trtexec args. TensorRT-Cloud supports a subset of trtexec args through this flag. If a new flag is not explicitly supported, TensorRT-Cloud will reject the build request.

    • If the model has dynamic input shapes, then minimum, optimal, and maximum values for the shapes must be provided in the --trtexec-args. Otherwise, static shapes will be assumed. This behavior is the same as trtexec. For more information, refer to the supported list of trtexec args.

Specifying the ONNX Model

The input ONNX model is given to trt-cloud build onnx using the --model argument. It may be in one of three formats:

  • A local ONNX file.

  • A local Zip file that contains an ONNX model and external weights. For more information, refer to Building with Large ONNX Files.

  • A URL to a model hosted on AWS S3 or GitHub. The URL must not require authentication headers.

    • For ONNX models hosted on S3, it is recommended that a pre-signed GET URL with a limited Time to Live (TTL) be created for use with TensorRT-Cloud.

    • For ONNX models hosted on GitHub, use the URL of the raw GitHub file instead of the URL to the GitHub Web UI. This can be achieved by copying the URL linked by the “View Raw”, “Raw”, or “Download Raw File” links. If opening the URL in a new browser tab does not result in the ONNX file downloading to your browser, TensorRT-Cloud will not accept your URL. For example, https://github.com/onnx/models/blob/main/Computer_Vision/mobilenetv2_050_Opset18_timm/mobilenetv2_050_Opset18.onnx is not a valid ONNX model URL, but https://github.com/onnx/models/raw/main/Computer_Vision/mobilenetv2_050_Opset18_timm/mobilenetv2_050_Opset18.onnx is valid.

Output Zip File

When the engine build is complete, TensorRT-Cloud will save the result as a zip file in the directory from which it was called. The zip file will also contain logs and engine metrics. Logs will still be returned if the build fails, for example, in the case of an invalid ONNX input file. The format and layout of the zip file are subject to change over time.

Weightful Engine Generation

A weightful engine is a traditional TensorRT engine that consists of both weights and NVIDIA CUDA kernels. It is self-contained and fully performant.

For more information about weight-stripped engines, refer to the NVIDIA TensorRT documentation.

Use Case

An application developer might want to generate engines to deploy it as-is within the application.

Weightful engines are the default engine type generated by TensorRT. Note that the generated engine CANNOT be refitted with different weights. For more information about refittable engines, refer to Refit Support.

Example

In this example, mobilenet.onnx is a local ONNX file.

trt-cloud build onnx --model mobilenet.onnx --gpu A100 --os linux --trtexec-args="--fp16 --bf16"

CLI Output

The engine will be available as engine.trt inside the resulting zip file.

[I] Uploading mobilenet.onnx
[I] Uploaded new NVCF asset with ID 2d3ac738-87ff-4205-bd09-294ddb11635b
[I] Selected NVCF Function 55e7f2c8-788c-498a-9ce9-db414e3d48cd with version cf3df613-5371-4e4b-a0e9-d3c9367fdc4a
[I] NVCF Request ID: edea572d-dd63-4433-8759-2193fadcece0
[I] Latest poll status: 202 at 17:38:50. Position in queue: 0.
Downloading  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[I] Last 5 lines of build.log:
---
[I]     [04/30/2024-00:38:49] [W] * GPU compute time is unstable, with coefficient of variance = 4.1679%.
[I]     [04/30/2024-00:38:49] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[I]     [04/30/2024-00:38:49] [I] Explanations of the performance metrics are printed in the verbose logs.
[I]     [04/30/2024-00:38:49] [I]
[I]     &&&& PASSED TensorRT.trtexec [TensorRT v100001] # trtexec --onnx=/var/inf/inputAssets/edea572d-dd63-4433-8759-2193fadcece0/2d3ac738-87ff-4205-bd09-294ddb11635b --saveEngine=/tmp/tmps0vgq76p/out/build_result/engine.trt --timingCacheFile=/tmp/tmps0vgq76p/out/build_result/build/tactic_cache --exportLayerInfo=/tmp/tmps0vgq76p/out/build_result/build/layer_info.json --exportProfile=/tmp/tmps0vgq76p/out/build_result/profiling/session0/timing.json --separateProfileRun --fp16 --bf16
[I] Saved build result to build_result.zip

Weight-Stripped Engine Generation

Starting in TensorRT version 10.0, TensorRT now supports weight-stripped, traditional engines consisting of CUDA kernels minus the weights.

Use Case

Applications with a small application footprint may build and ship weight-stripped engines for all the NVIDIA GPU SKUs in their installed base without bloating their application binary size. These weight-stripped engines can then be refitted with weights from an ONNX file directly on an end-user GPU.

A weight-stripped engine may be called trt-cloud build with the --strip-weights flag.

This will automatically perform the following steps:

  1. [local] TensorRT-Cloud CLI creates a copy of the ONNX model with the weight values removed.

  2. [local] TensorRT-Cloud CLI uploads the weight-strippe ONNX model to the appropriate endpoint.

  3. [cloud] TensorRT-Cloud generates the weight-strippe engine from the weight-strippe ONNX model.

  4. [optional][local] TensorRT-Cloud CLI downloads and refits the engine with weights from the original ONNX model if -local-refit is specified.

To enable automatic refit by TensorRT-Cloud CLI post engine build, append --local-refit to the build command.

Example

trt-cloud build onnx —strip-weights --model model.onnx --gpu RTX3070 --os windows

Note

When using trtexec standalone, the --stripWeights argument is required to build a weight-stripped engine. However, this CLI will automatically append --stripWeights to the trtexec args.

URL Inputs with --strip-weights

If an ONNX model is specified as a URL, the TensorRT-Cloud server will download that URL directly. This means the weights will NOT be removed from the model before sending to the server. If you do not wish to give the TensorRT-Cloud server access to the model weights, download the model locally and specify the ONNX model as a local file rather than a URL to trt-cloud build.

If --strip-weights is specified with an ONNX model URL, the downloaded TensorRT engine will still be weight-stripped. However, local refitting (--local-refit) is not supported on models specified as a URL. Download the model locally and run the refit command to refit the engine with weights.

Refit Support

To refit the weight-stripped engine with the original weights from the ONNX model, the CLI requires the following in the local environment:

  1. TensorRT Python package with the same version used to build the engine.

  2. GPU with the same SM version used to build the engine.

  3. The same weights were used to build the weight-strippe engine (unless --refit was specified in the trtexec arguments).

If --local-refit is specified, the refit will be performed at the end of a weight-stripped trt-cloud build unless the model is provided via a URL. The refit may also be run manually using the CLI.

trt-cloud refit --onnx model.onnx -e weightless.trt -o final.trt

Where:

  • model.onnx is the original ONNX model file.

  • weightless.trt is the weight-stripped engine file produced by TensorRT-Cloud.

  • final.trt is the file name for the final refitted engine file.

Note

  • refit is not supported on MacOS as TensorRT does not support MacOS.

  • By default, weight-stripped engines are only refittable with the original ONNX weights to preserve performance with their weightful equivalent. To allow for general fine-tunable refitting at the cost of performance, refer to Refittable Engine Generation (Weightful or Weight-Stripped).

Refittable Engine Generation (Weightful or Weight-Stripped)

To be fully refittable with a new set of weights, an engine must be built with the --refit trtexec arg. To do this with the TensorRT-Cloud CLI, provide --trtexec-args="--refit" as part of the build command.

To build a weightful and fine-tunable engine, run the following:

trt-cloud build onnx --model model.onnx --gpu RTX4090 --os windows --trtexec-args="--fp16 --refit"

To build a weight-stripped and fine-tunable engine, run the following:

trt-cloud build onnx --strip-weights --model model.onnx --gpu RTX4090 --os windows --trtexec-args="--fp16 --refit"

Note

  • Weight-stripped engines are not fine-tunable by default. They may only be refitted with weights from the original ONNX model. Although weight-stripped engines are built without most original weights, TensorRT may optimize them using the remaining weights, which will break under different weights.

  • TensorRT cannot refit UINT8 and BOOLEAN weights, even if the engine was built with --refit.

Building with Large ONNX Files

ONNX files have a 2 GB size limit. Deep learning models that do not fit into a single ONNX file must be split into a main ONNX file and one or more external weight files. To use an ONNX model with external weight files, compress the ONNX model and weights into a single zip file to pass to TensorRT-Cloud.

zip model.zip onnx_model/*
trt-cloud build onnx --model model.zip --gpu A100 --os linux

Note

TensorRT-Cloud currently limits the input file size to 5 GB. You can provide a pre-signed URL to a self-managed S3 bucket to work around this limit.

Resuming Interrupted Builds

TensorRT-Cloud prints a Request ID for every build to the console. If the TensorRT-Cloud CLI is interrupted while a build is in progress (for example, with a Ctrl+C Keyboard Interrupt), the TensorRT-Cloud server will continue building the input model. It is possible to continue monitoring the build status of an interrupted build by calling trt-cloud build request_id. Interrupted builds should be resumed as soon as possible since TensorRT-Cloud does not guarantee how long a Request ID is kept after the build finishes.

Builds interrupted shortly after invocation may not be assigned a Request ID.

Note

TensorRT-Cloud builds that were previously started cannot be canceled partway.

To start a build and interrupt it with Ctrl+C once the Request ID is printed, run:

trt-cloud build onnx --model mobilenet.onnx --gpu RTX4090 --os windows
[I] Uploading mobilenet.onnx
[I] NVCF asset already exists with ID 2d3ac738-87ff-4205-bd09-294ddb11635b
[I] Selected NVCF Function 5e357e09-b6bd-4f4b-86a2-3cee7c76985c with version 1b917b1a-5634-4bf3-a719-2fe6dba8422f
[I] NVCF Request ID: b1f3eaa1-4ef2-436e-9bfd-b72daa9c7681
[I] Latest poll status: 202 at 23:15:34. Position in queue: 0.^C
[I]
Caught KeyboardInterrupt. Build status may be queried using Request ID b1f3eaa1-4ef2-436e-9bfd-b72daa9c7681.
Traceback (most recent call last):
...
KeyboardInterrupt

To continue monitoring the build later, run the following:

trt-cloud build request_id b1f3eaa1-4ef2-436e-9bfd-b72daa9c7681
[I] Latest poll status: 202 at 23:17:21. Position in queue: Unknown.
Downloading  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[I] Last 5 lines of build.log:
---
[I]     [04/30/2024-06:16:55] [W] * GPU compute time is unstable, with coefficient of variance = 35.204%.
[I]     [04/30/2024-06:16:55] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[I]     [04/30/2024-06:16:55] [I] Explanations of the performance metrics are printed in the verbose logs.
[I]     [04/30/2024-06:16:55] [I]
[I]     &&&& PASSED TensorRT.trtexec [TensorRT v100001] # trtexec --onnx=/workdir/nvcf_root/assets\2d3ac738-87ff-4205-bd09-294ddb11635b --saveEngine=C:\Windows\SERVIC~1\NETWOR~1\AppData\Local\Temp\tmplw0_paic\out\build_result\engine.trt --timingCacheFile=C:\Windows\SERVIC~1\NETWOR~1\AppData\Local\Temp\tmplw0_paic\out\build_result\build\tactic_cache --exportLayerInfo=C:\Windows\SERVIC~1\NETWOR~1\AppData\Local\Temp\tmplw0_paic\out\build_result\build\layer_info.json --exportProfile=C:\Windows\SERVIC~1\NETWOR~1\AppData\Local\Temp\tmplw0_paic\out\build_result\profiling\session0\timing.json --separateProfileRun
[I] Saved build result to build_result_4.zip

Supported trtexec Arguments

The following list of arguments can be passed to trtexec using TensorRT-Cloud.

Note

Some arguments related to saving and loading files are not allowed. TensorRT-Cloud will automatically load the ONNX file and save the TensorRT engine, which would otherwise be done with special trtexec arguments.

'allocationStrategy', 'allowGPUFallback', 'allowWeightStreaming', 'avgRuns', 'avgTiming', 'best', 'bf16', 'buildDLAStandalone', 'builderOptimizationLevel', 'calibProfile', 'consistency', 'directIO', 'dumpLayerInfo', 'dumpOptimizationProfile', 'dumpOutput', 'dumpProfile', 'dumpRawBindingsToFile', 'dumpRefit', 'duration', 'errorOnTimingCacheMiss', 'excludeLeanRuntime', 'exposeDMA', 'fp16', 'fp8', 'getPlanVersionOnly', 'hardwareCompatibilityLevel', 'idleTime', 'ignoreParsedPluginLibs', 'infStreams', 'inputIOFormats', 'int4', 'int8', 'iterations', 'layerDeviceTypes', 'layerOutputTypes', 'layerPrecisions', 'markDebug', 'maxAuxStreams', 'maxShapes', 'maxShapesCalib', 'memPoolSize', 'minShapes', 'minShapesCalib', 'noBuilderCache', 'noCompilationCache', 'noDataTransfers', 'noTF32', 'optShapes', 'optShapesCalib', 'outputIOFormats', 'persistentCacheRatio', 'pi', 'pluginInstanceNorm', 'precisionConstraints', 'preview', 'profile', 'profilingVerbosity', 'refit', 'restricted', 'runtimePlatform', 'safe', 'shapes', 'skipInference', 'sleepTime', 'sparsity', 'stripAllWeights', 'stripWeights', 'stronglyTyped', 'tacticSources', 'tempfileControls', 'threads', 'timeDeserialize', 'useCudaGraph', 'useManagedMemory', 'useProfile', 'useRuntime', 'useSpinWait', 'vc', 'verbose', 'versionCompatible', 'warmUp', 'weightStreamingBudget', 'weightless'