Building an ONNX Engine#
ONNX models support a wide range of model types. Given an ONNX file and a target config TensorRT-Cloud will generate a corresponding engine using trtexec args, you can generate fully customizable engines.
Specifying an Engine Build Configuration#
The TensorRT-Cloud CLI trt-cloud build
command provides multiple arguments. To see the full list of arguments, run:
trt-cloud build onnx -h
Key arguments that allow for system and engine configuration are:
--gpu
picks the GPU target. Usetrt-cloud info
to get the list of available GPUs.--os
picks the OS target (linux
orwindows
).--trtexec-args
sets thetrtexec
arguments. TensorRT-Cloud supports a subset oftrtexec
arguments through this flag. If a new flag is not explicitly supported, TensorRT-Cloud will reject the build request.If the model has dynamic input shapes, then minimum, optimal, and maximum values for the shapes must be provided in
--trtexec-args
. Otherwise, static shapes will be assumed. This behavior is the same astrtexec
. For more information, refer to the supported list of trtexec args.
Specifying the ONNX Model#
The input ONNX model is given to trt-cloud build onnx
using one of --src-path
, --src-ngc
, or --src-url
. Each argument expects a different type of input.
Where:
--src-path
is a local path that contains the ONNX model.--src-url
is the URL to a model hosted on AWS S3 or GitHub.Note
The URL must not require authentication headers.
For ONNX models hosted on S3, it is recommended that a pre-signed
GET
URL with a limited Time to Live (TTL) is created for use with TensorRT-Cloud.For ONNX models hosted on GitHub, use the URL of the raw GitHub file instead of the URL to the GitHub Web UI. This can be achieved by copying the URL linked by the View Raw, Raw, or Download Raw File links. If opening the URL in a new browser tab does not result in the ONNX file downloading to your browser, TensorRT-Cloud will not accept your URL. For example:
Not a valid ONNX model URL:
https://github.com/onnx/models/blob/main/Computer_Vision/mobilenetv2_050_Opset18_timm/mobilenetv2_050_Opset18.onnx
A valid ONNX model URL:
https://github.com/onnx/models/raw/main/Computer_Vision/mobilenetv2_050_Opset18_timm/mobilenetv2_050_Opset18.onnx
--src-ngc
is an NGC Private Registry model location inorg/[team/]name[:version]
format which contains the ONNX model. For example:my-org/my-team/onnx-model:1.0
my-org/onnx-model:custom-version
For all of the above, the local path, URL, and NGC model must contain one of the following:
A single ONNX file with the
.onnx
file extension.A directory containing an ONNX model at the top level.
A zip file containing an ONNX model at the top level.
A local zip file and directory that contains an ONNX model and external weights. For more information, refer to the Building with Large ONNX Files section.
Querying the Status of a Build#
The status of a build can be queried via the trt-cloud build status
command.
$ trt-cloud build status <build_id>
┌──────────────────────────────────────────────────────────────────────────────
│ In Progress - Running engine build
│ 0.6 min
├──────────────────────────────────────────────────────────────────────────────
│ Latest 5 lines of trial log:
│ <log lines>
└──────────────────────────────────────────────────────────────────────────────
In the event of an unsuccessful build, we do our best to populate the error message encountered during the build process.
Obtaining the Results of a Build#
The results of a build can be obtained via the trt-cloud build results
command.
$ trt-cloud build results <BUILD_ID>
[I] Built engine was uploaded to NGC Private Registry: <link>
[I] Total size of built engine: 5.540 MB
Would you like to download the build result? [y/N]: y
[I] Downloading build result to './trt-cloud-build-result-<build_id>'...
[I] Download complete.
By default, built engines are uploaded to your organization’s NGC Private Registry under the model name trt-cloud-build-result-<build_id>
. You can either inspect the build result on the NGC website via the displayed link, or download the build result directly.
Weightful Engine Generation#
A weightful engine is a traditional TensorRT engine that consists of both weights and NVIDIA CUDA kernels. It is self-contained and fully performant.
For more information about weight-stripped engines, refer to the NVIDIA TensorRT documentation.
Use Case#
An application developer might want to generate engines to deploy it as-is within the application.
Weightful engines are the default engine type generated by TensorRT. Note that the generated engine cannot be refitted with different weights. For more information about refittable engines, refer to Refit Support.
Example
In this example, mobilenetv2_050_Opset18.onnx is a local ONNX file.
$ trt-cloud build onnx --src-path mobilenetv2_050_Opset18.onnx --gpu A100 --os linux --trtexec-args="--fp16 --bf16"
[I] Local model was provided. Checking NGC upload cache for existing model...
[I] Configuring NGC client with org: <org>, team: <team>
[I] Validating configuration...
[I] Successfully validated configuration.
[I] NGC client configured successfully.
[I] Computing hash of local path 'mobilenetv2_050_Opset18.onnx' for cache lookup...
[I] Creating NGC Model 'local-model-4596b9cc' in Private Registry
[I] Uploading local path 'mobilenetv2_050_Opset18.onnx' to NGC Model 'local-model-4596b9cc'
Upload progress: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.9/7.9 MB 100% 0:00:00
[I] Successfully uploaded NGC Model 'local-model-4596b9cc'
[I] Build session with build_id: <build_id> started.
[I] To check the status of the build, run:
[I] trt-cloud build status <build_id>
Download results:
$ trt-cloud build results <build_id>
[I] Built engine was uploaded to NGC Private Registry: <link>
[I] Total size of built engine: 5.540 MB
Would you like to download the build result? [y/N]: y
[I] Downloading engine to './trt-cloud-build-result-<build_id>'...
[I] Download complete.
Weight-Stripped Engine Generation#
Starting in TensorRT version 10.0, TensorRT supports weight-stripped, traditional engines consisting of CUDA kernels minus the weights.
Use Case#
Applications with a small application footprint may build and ship weight-stripped engines for all the NVIDIA GPU SKUs in their installed base without bloating their application binary size. These weight-stripped engines can then be refitted with weights from an ONNX file directly on an end-user GPU. For more information, refer to the TensorRT Developer Guide.
A weight-stripped engine can be created by using trt-cloud build
with the --strip-weights
flag.
This will automatically perform the following steps:
[local] TensorRT-Cloud CLI creates a copy of the ONNX model with the weight values removed.
[local] TensorRT-Cloud CLI uploads the weight-stripped ONNX model to the appropriate endpoint.
[cloud] TensorRT-Cloud generates the weight-stripped engine from the weight-stripped ONNX model.
For example:
trt-cloud build onnx --strip-weights --src-path model.onnx --gpu RTX3070 --os windows
Note
When using trtexec
standalone, the --stripWeights
argument is required to build a weight-stripped engine. However, the TensorRT-Cloud CLI will automatically append --stripWeights
to the trtexec
args.
URL Inputs with --strip-weights
#
If an ONNX model is specified as a URL, the TensorRT-Cloud server will download that URL directly. This means the weights will NOT be removed from the model before sending it to the server. If you do not wish to give the TensorRT-Cloud server access to the model weights, download the model locally and specify the ONNX model as a local file rather than a URL to trt-cloud build
.
If --strip-weights
is specified with an ONNX model URL, the downloaded TensorRT engine will still be weight-stripped.
Refit Support#
To refit the weight-stripped engine with the original weights from the ONNX model, the CLI requires the following in the local environment:
TensorRT Python package with the same version used to build the engine.
GPU with the same SM version used to build the engine.
The same weights that were used to build the weight-stripped engine (unless
--refit
was specified in the trtexec arguments).
To refit an ONNX model, use the trt-cloud refit
CLI command:
trt-cloud refit --onnx model.onnx -e weightless.trt -o final.trt
Where:
model.onnx
is the original ONNX model file.weightless.trt
is the weight-stripped engine file produced by TensorRT-Cloud.final.trt
is the file name for the final refitted engine file.
Note
refit
is not supported on MacOS as TensorRT does not support MacOS.By default, weight-stripped engines are only refittable with the original ONNX weights to preserve performance with their weightful equivalent. To allow for general fine-tunable refitting at the cost of performance, refer to Refittable Engine Generation (Weightful or Weight-Stripped).
Refittable Engine Generation (Weightful or Weight-Stripped)#
To be fully refittable with a new set of weights, an engine must be built with the --refit
trtexec
argument. To do this with the TensorRT-Cloud CLI, provide --trtexec-args="--refit"
as part of the build command.
To build a weightful and fine-tunable engine, run:
trt-cloud build onnx --src-path model.onnx --gpu RTX4090 --os windows --trtexec-args="--fp16 --refit"
To build a weight-stripped and fine-tunable engine, run:
trt-cloud build onnx --strip-weights --src-path model.onnx --gpu RTX4090 --os windows --trtexec-args="--fp16 --refit"
Note
Weight-stripped engines are not fine-tunable by default. They may only be refitted with weights from the original ONNX model. Although weight-stripped engines are built without most original weights, TensorRT may optimize them using the remaining weights, which will break under different weights.
TensorRT cannot refit
UINT8
andBOOLEAN
weights, even if the engine was built with--refit
.
Building with Large ONNX Files#
ONNX files have a 2 GB size limit. Deep learning models that do not fit into a single ONNX file must be split into a main ONNX file and one or more external weight files. To use an ONNX model with external weight files, compress the ONNX model and weights into a single zip file to pass to TensorRT-Cloud.
zip model.zip onnx_model/*
trt-cloud build onnx --src-path model.zip --gpu A100 --os linux
Supported trtexec
Arguments#
The following list of arguments can be passed to trtexec
using TensorRT-Cloud.
Note
Some arguments related to saving and loading files are not allowed. TensorRT-Cloud will automatically load the ONNX file and save the TensorRT engine, which would otherwise be done with special trtexec
arguments.
'allocationStrategy', 'allowGPUFallback', 'allowWeightStreaming', 'avgRuns', 'avgTiming', 'best', 'bf16', 'buildDLAStandalone', 'builderOptimizationLevel', 'calibProfile', 'consistency', 'directIO', 'dumpLayerInfo', 'dumpOptimizationProfile', 'dumpOutput', 'dumpProfile', 'dumpRawBindingsToFile', 'dumpRefit', 'duration', 'errorOnTimingCacheMiss', 'excludeLeanRuntime', 'exposeDMA', 'fp16', 'fp8', 'getPlanVersionOnly', 'hardwareCompatibilityLevel', 'idleTime', 'ignoreParsedPluginLibs', 'infStreams', 'inputIOFormats', 'int4', 'int8', 'iterations', 'layerDeviceTypes', 'layerOutputTypes', 'layerPrecisions', 'markDebug', 'maxAuxStreams', 'maxShapes', 'maxShapesCalib', 'memPoolSize', 'minShapes', 'minShapesCalib', 'noBuilderCache', 'noCompilationCache', 'noDataTransfers', 'noTF32', 'optShapes', 'optShapesCalib', 'outputIOFormats', 'persistentCacheRatio', 'pi', 'pluginInstanceNorm', 'precisionConstraints', 'preview', 'profile', 'profilingVerbosity', 'refit', 'restricted', 'runtimePlatform', 'safe', 'shapes', 'skipInference', 'sleepTime', 'sparsity', 'stripAllWeights', 'stripWeights', 'stronglyTyped', 'tacticSources', 'tempfileControls', 'threads', 'timeDeserialize', 'useCudaGraph', 'useManagedMemory', 'useProfile', 'useRuntime', 'useSpinWait', 'vc', 'verbose', 'versionCompatible', 'warmUp', 'weightStreamingBudget', 'weightless'
For more information, refer to the TensorRT Developer Guide.
Custom Build Result Location#
Engine build output management is shared between ONNX engine build and TensorRT-LLM engine build.
For more information on how to provide custom output locations, refer to the Custom Build Result Locations section.
Running a TensorRT Engine#
The build, if successful, should return a TensorRT engine in the build result archive.
For production deployment of TensorRT engines, NVIDIA Triton Inference Server with the TensorRT backend can be used. For more information, refer to the Triton Inference Server Quickstart Guide.
Another option is to use the TensorRT APIs. For more information, refer to the TensorRT Quick Start Guide.