Building an ONNX Engine

Prerequisites

  1. Ensure you can log into TensorRT-Cloud.

    Important

    Building on-demand engines is provided as a closed Early Access (EA) product. Access is restricted and is provided upon request (refer to Getting TensorRT-Cloud Access). These features will not be functional unless access is granted.

ONNX support is provided through on-demand engine building. You must provide your ONNX file and the engine config you would like to target, and TensorRT-Cloud will generate a corresponding engine.

You can generate fully customizable engines through the use of trtexec args. Additionally, the TensorRT-Cloud CLI provides utility flags for building weightless engines. In short, building weightless engines reduces the engine binary size at a potential performance cost.

In the sections below, we provide examples for building different kinds of engines.

Currently, only the latest version of TensorRT 10.0 is supported.

Specifying an Engine Build Configuration

The TensorRT-Cloud CLI trt-cloud build command provides multiple arguments. To see the full list of arguments, run:

trt-cloud build -h

Key arguments that allow for system and engine configuration are:

  • --gpu - Picks GPU target. Use trt-cloud info to get the list of available GPUs.

  • --os - Picks OS target (linux or windows)

  • --trtexec-args - Sets trtexec args. TensorRT-Cloud supports a subset of trtexec args through this flag. If a new flag is not explicitly supported, TensorRT-Cloud will reject the build request.

    • If the model has dynamic input shapes, then minimum, optimal, and maximum values for the shapes must be provided in the --trtexec-args. Otherwise, static shapes will be assumed. This behavior is the same as trtexec. For more information, refer to the supported list of trtexec args.

Specifying the ONNX Model

The input ONNX model is given to trt-cloud build using the --onnx argument. It may be in one of three formats:

  • A local ONNX file.

  • A local Zip file which contains an ONNX model and external weights. For more information, refer to Building with Large ONNX Files.

  • A URL to a model hosted on AWS S3 or GitHub. The URL must not require authentication headers.

    • For ONNX models hosted on S3, it is recommended to create a presigned GET URL with a limited Time-to-Live (TTL) for use with TensorRT-Cloud.

    • For ONNX models hosted on GitHub, use the URL of the raw GitHub file instead of the URL to the GitHub Web UI. This can be achieved by copying the URL linked by the “View Raw”, “Raw”, or “Download Raw File” links. If opening the URL in a new browser tab does not result in the ONNX file downloading to your browser, TensorRT-Cloud will not accept your URL. For example, https://github.com/onnx/models/blob/main/Computer_Vision/mobilenetv2_050_Opset18_timm/mobilenetv2_050_Opset18.onnx is not a valid ONNX model URL, but https://github.com/onnx/models/raw/main/Computer_Vision/mobilenetv2_050_Opset18_timm/mobilenetv2_050_Opset18.onnx is valid.

Output Zip File

When the engine build completes, TensorRT-Cloud will save the result as a zip file to the directory where it was called from. The zip file will also contain logs and engine metrics. Logs will still be returned if the build ends in a failure, for example, in the case of an invalid ONNX input file. The format and layout of the zip file is subject to change over time.

Weightful Engine Generation

A weightful engine is a traditional TensorRT engine that consists of both weights and NVIDIA CUDA kernels. Such an engine is self-contained and fully performant.

For more information about weightless engines, refer to the NVIDIA TensorRT documentation.

Use Case

An application developer might want to generate engines to deploy it as-is within the application.

Weightful engines are the default engine type generated by TensorRT. Note that the generated engine CANNOT be refitted with different weights. For more information about refittable engines, refer to Refit Support.

Example

In this example, mobilenet.onnx is a local ONNX file.

trt-cloud build --onnx mobilenet.onnx --gpu A100 --os linux --trtexec-args="--fp16 --bf16"

CLI Output

The engine will be available as engine.trt inside the resulting zip file.

[I] Uploading mobilenet.onnx
[I] Uploaded new NVCF asset with ID 2d3ac738-87ff-4205-bd09-294ddb11635b
[I] Selected NVCF Function 55e7f2c8-788c-498a-9ce9-db414e3d48cd with version cf3df613-5371-4e4b-a0e9-d3c9367fdc4a
[I] NVCF Request ID: edea572d-dd63-4433-8759-2193fadcece0
[I] Latest poll status: 202 at 17:38:50. Position in queue: 0.
Downloading  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[I] Last 5 lines of build.log:
---
[I]     [04/30/2024-00:38:49] [W] * GPU compute time is unstable, with coefficient of variance = 4.1679%.
[I]     [04/30/2024-00:38:49] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[I]     [04/30/2024-00:38:49] [I] Explanations of the performance metrics are printed in the verbose logs.
[I]     [04/30/2024-00:38:49] [I]
[I]     &&&& PASSED TensorRT.trtexec [TensorRT v100001] # trtexec --onnx=/var/inf/inputAssets/edea572d-dd63-4433-8759-2193fadcece0/2d3ac738-87ff-4205-bd09-294ddb11635b --saveEngine=/tmp/tmps0vgq76p/out/build_result/engine.trt --timingCacheFile=/tmp/tmps0vgq76p/out/build_result/build/tactic_cache --exportLayerInfo=/tmp/tmps0vgq76p/out/build_result/build/layer_info.json --exportProfile=/tmp/tmps0vgq76p/out/build_result/profiling/session0/timing.json --separateProfileRun --fp16 --bf16
[I] Saved build result to build_result.zip

Weight-Stripped Engine Generation

Starting in TensorRT version 10.0, TensorRT now supports weight-stripped engines, which are traditional engines consisting of CUDA kernels minus the weights.

Use Case

Applications that care about a small application footprint may build and ship weight-stripped engines for all the NVIDIA GPU SKUs in their installed base without bloating their application binary size. These weight-stripped engines can then be refitted with weights from an ONNX file directly on an end-user GPU.

A weight-stripped engine may be built by calling trt-cloud build with the --strip-weights flag.

This will automatically perform the following steps:

  1. [local] TensorRT-Cloud CLI creates a copy of the ONNX model with the weight values removed.

  2. [local] TensorRT-Cloud CLI uploads the weightless ONNX model to the appropriate endpoint.

  3. [cloud] TensorRT-Cloud generates the weightless engine from the weightless ONNX model.

  4. [optional][local] TensorRT-Cloud CLI downloads and refits the engine with weights from the original ONNX model, if -local-refit is specified.

To enable automatic refit by TensorRT-Cloud CLI post engine build, append --local-refit to the build command.

Example

trt-cloud build -strip-weights --onnx model.onnx --gpu RTX3070 --os windows

Note

When using trtexec standalone, the --stripWeights argument is required to build a weightless engine. However, this CLI will automatically append --stripWeights to the trtexec args.

URL Inputs with --strip-weights

If an ONNX model is specified as a URL, the TensorRT-Cloud server will download that URL directly. This means that the weights will NOT be removed from the model before the model is sent to the server. If you do not wish to give the TensorRT-Cloud server access to the model weights, then download the model locally and specify the ONNX model as a local file rather than a URL to trt-cloud build.

If --strip-weights is specified in combination with a ONNX model URL, then the downloaded TensorRT engine will still be a weight-stripped engine. However, local refitting (--local-refit) is not supported on models that are specified as a URL. To refit the engine with weights, download the model locally, and run the refit command.

Refit Support

To refit the weight-stripped engine with the original weights from the ONNX model, the CLI requires the following in the local environment:

  1. TensorRT Python package with the same version used to build the engine.

  2. GPU with the same SM version used to build the engine.

  3. The same weights that were used to build the weightless engine (unless --refit was specified in the trtexec arguments).

Refit will be performed at the end of a weightless trt-cloud build if --local-refit is specified, unless the model is provided via a URL. Refit may also be run manually using the CLI.

trt-cloud refit --onnx model.onnx -e weightless.trt -o final.trt

Where:

  • model.onnx is the original ONNX model file.

  • weightless.trt is the weightless engine file produced by TensorRT-Cloud.

  • final.trt is the file name for the final refitted engine file.

Note

  • refit is not supported on MacOS as TensorRT does not support MacOS.

  • By default, weight-stripped engines are only refittable with the original ONNX weights in order to preserve performance with their weightful equivalent. To allow for general fine-tunable refitting at the cost of performance, refer to Refittable Engine Generation (Weightful or Weightless).

Refittable Engine Generation (Weightful or Weightless)

For an engine to be fully refittable with a new set of weights, it must be built with the --refit trtexec arg. To do this with the TensorRT-Cloud CLI, provide --trtexec-args="--refit" as part of the build command.

To build a weightful and fine-tunable engine, run:

trt-cloud build --onnx model.onnx --gpu RTX4090 --os windows --trtexec-args="--fp16 --refit"

To build a weightless and fine-tunable engine, run:

trt-cloud build --weightless --onnx model.onnx --gpu RTX4090 --os windows --trtexec-args="--fp16 --refit"

Note

  • Weightless engines are not fine-tunable by default. They may only be refitted with weights from the original ONNX model. Although weightless engines are built without most of their original weights, TensorRT may make optimizations using the remaining weights which will break under a different set of weights.

  • TensorRT cannot refit UINT8 and BOOLEAN weights, even if the engine was built with --refit.

Building with Large ONNX Files

ONNX files have a 2 GB size limit. Deep learning models which do not fit into a single ONNX file must be split into a main ONNX file and one or more external weight files. To use an ONNX model with external weight files, compress the ONNX model and weights into a single zip file to pass to TensorRT-Cloud.

zip model.zip onnx_model/*
trt-cloud build --onnx model.zip --gpu A100 --os linux

Note

TensorRT-Cloud currently has a 5 GB limit on the input file size. To work around this limit, you can provide a pre-signed URL to a self-managed S3 bucket.

Resuming Interrupted Builds

TensorRT-Cloud prints a Request ID to the console for every build. If the TensorRT-Cloud CLI is interrupted while a build is in progress (for example, with a Ctrl+C Keyboard Interrupt), the TensorRT-Cloud server will still continue building the input model. It is possible to continue monitoring the build status of an interrupted build by calling trt-cloud build and passing in the --request-id argument instead of --onnx. Interrupted builds should be resumed as soon as possible, since TensorRT-Cloud does not make any guarantees for how long a Request ID is kept for after the build finishes.

Builds which are interrupted very shortly after invocation may not be assigned a Request ID.

Note

TensorRT-Cloud builds that were previously started cannot be canceled partway.

To start a build and interrupt it with Ctrl+C once the Request ID is printed, run:

trt-cloud build --onnx mobilenet.onnx --gpu RTX4090 --os windows
[I] Uploading mobilenet.onnx
[I] NVCF asset already exists with ID 2d3ac738-87ff-4205-bd09-294ddb11635b
[I] Selected NVCF Function 5e357e09-b6bd-4f4b-86a2-3cee7c76985c with version 1b917b1a-5634-4bf3-a719-2fe6dba8422f
[I] NVCF Request ID: b1f3eaa1-4ef2-436e-9bfd-b72daa9c7681
[I] Latest poll status: 202 at 23:15:34. Position in queue: 0.^C
[I]
Caught KeyboardInterrupt. Build status may be queried using Request ID b1f3eaa1-4ef2-436e-9bfd-b72daa9c7681.
Traceback (most recent call last):
...
KeyboardInterrupt

To continue monitoring the build later, run:

trt-cloud build --request
-id b1f3eaa1-4ef2-436e-9bfd-b72daa9c7681
[I] Latest poll status: 202 at 23:17:21. Position in queue: Unknown.
Downloading  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
[I] Last 5 lines of build.log:
---
[I]     [04/30/2024-06:16:55] [W] * GPU compute time is unstable, with coefficient of variance = 35.204%.
[I]     [04/30/2024-06:16:55] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[I]     [04/30/2024-06:16:55] [I] Explanations of the performance metrics are printed in the verbose logs.
[I]     [04/30/2024-06:16:55] [I]
[I]     &&&& PASSED TensorRT.trtexec [TensorRT v100001] # trtexec --onnx=/workdir/nvcf_root/assets\2d3ac738-87ff-4205-bd09-294ddb11635b --saveEngine=C:\Windows\SERVIC~1\NETWOR~1\AppData\Local\Temp\tmplw0_paic\out\build_result\engine.trt --timingCacheFile=C:\Windows\SERVIC~1\NETWOR~1\AppData\Local\Temp\tmplw0_paic\out\build_result\build\tactic_cache --exportLayerInfo=C:\Windows\SERVIC~1\NETWOR~1\AppData\Local\Temp\tmplw0_paic\out\build_result\build\layer_info.json --exportProfile=C:\Windows\SERVIC~1\NETWOR~1\AppData\Local\Temp\tmplw0_paic\out\build_result\profiling\session0\timing.json --separateProfileRun
[I] Saved build result to build_result_4.zip

Supported trtexec Arguments

The following list of arguments can be passed to trtexec using TensorRT-Cloud.

Note

Some arguments that relate to saving and loading files are not allowed. TensorRT-Cloud will automatically handle loading the ONNX file and saving of the TensorRT engine, which would otherwise be done with special trtexec arguments.

--allocationStrategy, --allowGPUFallback, --allowWeightStreaming, --avgRuns, --avgTiming, --best, --bf16, --buildDLAStandalone, --builderOptimizationLevel, --calibProfile, --consistency, --directIO, --dumpLayerInfo, --dumpOptimizationProfile, --dumpOutput, --dumpProfile, --dumpRawBindingsToFile, --dumpRefit, --duration, --errorOnTimingCacheMiss, --excludeLeanRuntime, --exposeDMA, --fp16, --fp8, --getPlanVersionOnly, --hardwareCompatibilityLevel, --idleTime, --ignoreParsedPluginLibs, --infStreams, --inputIOFormats, --int8, --iterations, --layerDeviceTypes, --layerOutputTypes, --layerPrecisions, --markDebug, --maxAuxStreams, --maxShapes, --maxShapesCalib, --memPoolSize, --minShapes, --minShapesCalib, --noBuilderCache, --noCompilationCache, --noDataTransfers, --noTF32, --optShapes, --optShapesCalib, --outputIOFormats, --persistentCacheRatio, --pi, --pluginInstanceNorm, --precisionConstraints, --preview, --profile, --profilingVerbosity, --refit, --restricted, --safe, --shapes, --skipInference, --sleepTime, --sparsity, --stripWeights, --stronglyTyped, --tacticSources, --tempfileControls, --threads, --timeDeserialize, --useCudaGraph, --useManagedMemory, --useProfile, --useRuntime, --useSpinWait, --vc, --verbose, --versionCompatible, --warmUp, --weightStreamingBudget, --weightless