CPU-Only AOT and TensorRT-RTX Engines#

The CPU-only Ahead Of Time (AOT) feature is enabled by default. When building an engine, TensorRT-RTX does not require a GPU device to be present. The engine is portable to both Windows and Linux operating systems with any supported NVIDIA Ampere and later devices.

All model weights are stored inside the engine by default. NVIDIA Turing devices are not supported without further configuration. This section provides details about the Compute Capability API and instructions for building a weightless engine for deployment with minimal storage footprint.

Compute Capability#

Compute capability defines the hardware features and supported instructions for each NVIDIA GPU architecture. For information regarding the compute capability of your GPU, refer to NVIDIA CUDA GPU Compute Capability. By default, an engine built by TensorRT-RTX can be run on GPU devices with compute capability 8.0, 8.6, 8.9, and 12.0. To build engines for Turing devices, a compute capability of 7.5 needs to be specified through the API.

Specifying Target Compute Capabilities#

One or more compute capabilities can be specified through IBuilderConfig. The following example shows how to set a single target compute capability of 7.5 for Turing RTX devices.

C++

IBuilderConfig* config = builder->createBuilderConfig();
config->setNbComputeCapabilities(1);
config->setComputeCapability(ComputeCapability::kSM75, 0);

Python

builder_config = builder.create_builder_config()
builder_config.num_compute_capabilities = 1
builder_config.set_compute_capability(trt.ComputeCapability.SM75, 0)

A ComputeCapability::kCURRENT flag is provided to turn off CPU-only AOT and to use the current GPU device in the environment as the target to compile an engine. Currently, ComputeCapability::kCURRENT is supported when it is the only target compute capability and the number of compute capabilities is one.

When one or multiple compute capabilities are set, the engine built can only be run on specified target devices. If Turing (compute capability 7.5) is specified as one of multiple targets, the result engine is not guaranteed to be performant on Ampere or later devices.

The tensorrt_rtx tool provides the flag --useGpu to be equivalent to setting ComputeCapability::kCURRENT; and option --computeCapabilities=<list_of_CCs> to accept one or multiple compute capabilities.

Compile Models with Hardware-Specific Data Types#

When hardware-specific data types appear in the network definition, TensorRT-RTX will either emit a warning message or an error based on how the API is called by users.

Commonly used hardware-specific data types include:

FP4 is only supported on NVIDIA Blackwell and later (12.0 and later).
FP8 is only supported on NVIDIA Ada Lovelace and later (8.9 and later).
BF16 and INT4 are only supported on Ampere and later (8.0 and later).
FP16, FP32, and other integer types are supported on all supported hardware

The behavior of TensorRT-RTX depends on whether one or more compute capabilities are specified. For example:

If no compute capability is specified (default), then:
- For a model containing FP8, the compiled engine can only be run on Ada and later (compute capability 8.9 or later).
- For a model containing FP4, the compiled engine can only be run on Blackwell and later (compute capability 12.0 or later).
- For all other cases, the compiled engine can be run on Ampere and later (compute capability 8.0 or later).
If one or more compute capabilities are specified, then:
- For a model containing data types that are not supported by any one of the compute capabilities specified by the user, an error will be recorded by IErrorRecorder.

Weightless Engines#

An engine without weights helps minimize the storage footprint for deployment. Weight-stripping build configuration can be enabled, so that TensorRT-RTX enables refit only for constant weights that do not impact the builder’s ability to optimize and produce an engine with the same runtime performance as a non-refittable engine. Those weights are then omitted from the serialized engine, resulting in a small engine file that can be refitted at runtime using custom weights or weights from the ONNX model. When weight-stripping build configuration is enabled, a GPU device is required for the AOT build and at most, one compute capability can be set through the builder config.

Building a Weightless Engine#

The tensorrt_rtx tool provides flags --stripWeights and --refit to enable the weight-stripping build configuration. The corresponding builder flags are kSTRIP_PLAN and kREFIT.

C++

...
      config->setFlag(BuilderFlag::kSTRIP_PLAN);
      config->setFlag(BuilderFlag::kREFIT);
      builder->buildSerializedNetwork(network, config);

Python

...
config.flags |= 1 << int(trt.BuilderFlag.STRIP_PLAN)
config.flags |= 1 << int(trt.BuilderFlag.REFIT)
builder.build_serialized_network(network, config)

After the engine is built, save the engine file and distribute it to the installer.

Refitting a Weightless Engine#

On the client side, when launching the network for the first time, the user can refit all the weights back to the engine. Since all the weights in the engine were removed, each weight needs to be updated one by one. After all weights are updated, save the full TensorRT-RTX engine file to be used in the application for future inference.

C++

IRefitter* refitter = createInferRefitter(*engine, gLogger);
int32_t const n = refitter->getAllWeights(0, nullptr);
for (int32_t i = 0; i < n; ++i) {
    refitter->setNamedWeights(weightsNames[i], Weights{...});
}
auto serializationConfig = SampleUniquePtr<nvinfer1::ISerializationConfig>(cudaEngine->createSerializationConfig());
auto serializationFlag = serializationConfig->getFlags()
serializationFlag &= ~(1<<static_cast<uint32_t>(nvinfer1::SerializationFlag::kEXCLUDE_WEIGHTS));
serializationConfig->setFlags(serializationFlag);
// hostMemory will contain the full engine
auto hostMemory = SampleUniquePtr<nvinfer1::IHostMemory>(cudaEngine->serializeWithConfig(*serializationConfig));

Python

refitter = trt.Refitter(engine, TRT_LOGGER)
all_weights = refitter.get_all()
for name in all_weights:
    refitter.set_named_weights(name, weights[name])
serialization_config = engine.create_serialization_config()
serialization_config.flags &= ~(1 << int(trt.SerializationFlag.EXCLUDE_WEIGHTS))
binary = engine.serialize_with_config(serialization_config)

Refitting a Weightless Engine Directly with ONNX Models#

When working with weight-stripped engines created from ONNX models, the refit process can be done automatically with the IParserRefitter class from the ONNX parser library. The following code shows how to create the class and run the refit process.

C++

IRefitter* refitter = createInferRefitter(*engine, gLogger);
IParserRefitter* parserRefitter = createParserRefitter(*refitter, gLogger);
bool result = parserRefitter->refitFromFile(“path_to_onnx_model”);
refitSuccess = refitter->refitCudaEngine();

Python

refitter = trt.Refitter(engine, TRT_LOGGER)
parser_refitter = trt.OnnxParserRefitter(refitter, TRT_LOGGER)
result = parser_refitter.refit_from_file(“path_to_onnx_model”)
refit_success = refitter.refit_cuda_engine()