Migrating from TensorRT 10.x to 11.x#

TensorRT 11.x introduces several breaking changes that affect how you build engines, run inference, and use plugins. The sections below summarize the key changes. For language-specific and tool-specific migration details, select the page that matches your integration:

Python Integration

Migrate Python inference and engine-build code, and review Python APIs removed in 11.x with their replacements.

C++ Integration

Migrate C++ inference and engine-build code, and review C++ APIs and plugins removed in 11.x with their replacements.

trtexec Command-Line Usage

Migrate trtexec command lines, and review removed or deprecated flag replacements.

Safety Runtime

Migrate safety runtime code paths, including build-time safety scope validation and kernel checker changes.

NVIDIA DriveOS

Platform-specific migration guidance for DRIVE Thor and DRIVE Orin, including version compatibility and DLA considerations.

Jetson / JetPack

Platform-specific migration guidance for Jetson Orin and Jetson Thor, including JetPack dependencies and cross-compilation.

IEngineInspector JSON Output

Migrate code that parses IEngineInspector JSON output, including the replacement of Bindings with I/O Tensors and the split of the Format/DataType field.

Strongly Typed Networks Are Now Required#

TensorRT 11.x removes all weak typing APIs. In TensorRT 10.x, you could enable reduced-precision types on the model level by setting builder flags such as BuilderFlag::kFP16 or BuilderFlag::kINT8, and TensorRT would automatically consider lower-precision kernels. While this made it simple to get higher throughput, it could compromise accuracy and did not provide adequate control to model authors.

Similarly, implicit quantization is not supported in TensorRT 11.x. To quantize a model, you must add quantize and dequantize layers to the network.

The most straightforward way to get results similar to using the trtexec --fp16 flag or BuilderFlag::kFP16 is to use ModelOpt’s AutoCast feature on your model. Refer to the ModelOpt AutoCast documentation for more information; however, for a basic conversion from FP32 layers to FP16, run:

python -m modelopt.onnx.autocast --onnx_path model.onnx

For quantizing a network, run:

python -m modelopt.onnx.quantization --onnx_path model.onnx --calibration_data data.npz

For more control or other advanced quantization topics (such as quantization-aware training), refer to the ModelOpt Quantization documentation.

You can also manually control a layer’s I/O types by using INetworkDefinition::addCast or adding addCast nodes to the ONNX graph.

Similarly, you can apply quantization to a layer in your network using the INetworkDefinition::addQuantize and INetworkDefinition::addDequantize APIs, or by adding QuantizeLinear and DequantizeLinear nodes to the ONNX graph manually. This approach works, but determining which layers must run at high precision and calibrating the quantization scales for optimal accuracy can be challenging. For most models, use ModelOpt to apply quantization effectively.

V2 API Methods Replaced with Updated Versions#

Several API methods have been replaced with updated versions that use int64_t instead of size_t, accept additional parameters, or support asynchronous operation:

Category

Removed

Replacement

Device memory size

getDeviceMemorySize()

getDeviceMemorySizeV2() (returns int64_t)

Device memory (profile)

getDeviceMemorySizeForProfile()

getDeviceMemorySizeForProfileV2() (returns int64_t)

Weight streaming budget

setWeightStreamingBudget() / getWeightStreamingBudget()

setWeightStreamingBudgetV2() / getWeightStreamingBudgetV2()

Profile tensor values

getProfileTensorValues()

getProfileTensorValuesV2() (returns int64_t values)

Deserialization

deserializeCudaEngine(IStreamReader&)

deserializeCudaEngine(IStreamReaderV2&)

Other Removals#

  • Tactic sources: The kCUBLAS, kCUBLAS_LT, and kCUDNN values of the TacticSource enum have been removed. TensorRT no longer uses these external libraries for core operations.

  • Implicit batch: allInputShapesSpecified() has been removed.

  • Built-in plugins removed: 16 plugins deprecated before TensorRT 10 have been removed, including the NMS family (BatchedNMS_TRT, BatchedNMSDynamic_TRT, EfficientNMS_ONNX_TRT, NMS_TRT, NMSDynamic_TRT), activations (Clip_TRT, CustomGeluPluginDynamic, LReLU_TRT), and BatchTilePlugin_TRT, CoordConvAC, Normalize_TRT, Proposal, SingleStepLSTMPlugin, SpecialSlice_TRT, Split. Refer to Removed C++ Plugins and Replacements for replacements.

  • BERT plugin family deprecated: The OSS BERT plugin classes (bertQKVToContextPlugin / CustomQKVToContextPluginDynamic versions 1 through 4, and CustomEmbLayerNormPluginDynamic) are deprecated in 11.0.0 and scheduled for removal in a future release. Use the native attention path (addAttentionV2) for QKV-to-context, and standard IGatherLayer + addNormalizationV2() for embedding + layer normalization. Refer to Deprecated BERT Plugins for per-plugin replacements.

  • ONNX parser: supportsModel() and parseWithWeightDescriptors() have been removed.

  • Layer overloads: Older overloads of addTopK, addNonZero, and addNMS that did not accept an indicesType parameter have been removed. Use the versions that accept a DataType parameter for the indices output type.

  • Multi-Device preview flag: The PreviewFeature::kMULTIDEVICE_RUNTIME_10_16 value has been removed. setPreviewFeature() is no longer required to access Multi-Device Inference features.

Note

If you are migrating from TensorRT 8.x, refer to the appendix first, then return here for the 10.x-to-11.x changes.