IBuilderConfig

tensorrt.QuantizationFlag

List of valid flags for quantizing the network to int8.

Members:

CALIBRATE_BEFORE_FUSION : Run int8 calibration pass before layer fusion. Only valid for IInt8LegacyCalibrator and IInt8EntropyCalibrator. We always run int8 calibration pass before layer fusion for IInt8MinMaxCalibrator and IInt8EntropyCalibrator2. Disabled by default.

tensorrt.DeviceType

Device types that TensorRT can execute on

Members:

GPU : GPU device

DLA : DLA core

tensorrt.ProfilingVerbosity

Profiling verbosity in NVTX annotations and the engine inspector

Members:

LAYER_NAMES_ONLY : Print only the layer names. This is the default setting.

DETAILED : Print detailed layer information including layer names and layer parameters.

NONE : Do not print any layer information.

tensorrt.TacticSource

Tactic sources that can provide tactics for TensorRT.

Members:

CUBLAS :

Enables cuBLAS tactics. Disabled by default. [DEPRECATED] Deprecated in TensorRT 10.0. NOTE: Disabling CUBLAS tactic source will cause the cuBLAS handle passed to plugins in attachToContext to be null.

CUBLAS_LT :

Enables CUBLAS_LT tactics. Disabled by default. [DEPRECATED] Deprecated in TensorRT 9.0.

CUDNN :

Enables cuDNN tactics. Disabled by default. [DEPRECATED] Deprecated in TensorRT 10.0. NOTE: Disabling CUDNN tactic source will cause the cuDNN handle passed to plugins in attachToContext to be null.

EDGE_MASK_CONVOLUTIONS :

Enables convolution tactics implemented with edge mask tables. These tactics tradeoff memory for performance by consuming additional memory space proportional to the input size. Enabled by default.

JIT_CONVOLUTIONS :

Enables convolution tactics implemented with source-code JIT fusion. The engine building time may increase when this is enabled. Enabled by default.

tensorrt.EngineCapability
List of supported engine capability flows.

The EngineCapability determines the restrictions of a network during build time and what runtime it targets. When BuilderFlag::kSAFETY_SCOPE is not set (by default), EngineCapability.STANDARD does not provide any restrictions on functionality and the resulting serialized engine can be executed with TensorRT’s standard runtime APIs in the nvinfer1 namespace. EngineCapability.SAFETY provides a restricted subset of network operations that are safety certified and the resulting serialized engine can be executed with TensorRT’s safe runtime APIs in the tensorrt.tensort.safe namespace. EngineCapability.DLA_STANDALONE provides a restricted subset of network operations that are DLA compatible and the resulting serialized engine can be executed using standalone DLA runtime APIs. See sampleCudla for an example of integrating cuDLA APIs with TensorRT APIs.

Members:

STANDARD : Standard: TensorRT flow without targeting the standard runtime. This flow supports both DeviceType::kGPU and DeviceType::kDLA.

SAFETY : Safety: TensorRT flow with restrictions targeting the safety runtime. See safety documentation for list of supported layers and formats. This flow supports only DeviceType::kGPU.

DLA_STANDALONE : DLA Standalone: TensorRT flow with restrictions targeting external, to TensorRT, DLA runtimes. See DLA documentation for list of supported layers and formats. This flow supports only DeviceType::kDLA.

tensorrt.BuilderFlag

Valid modes that the builder can enable when creating an engine from a network definition.

Members:

FP16 : Enable FP16 layer selection

BF16 : Enable BF16 layer selection

INT8 : Enable Int8 layer selection

DEBUG : Enable debugging of layers via synchronizing after every layer

GPU_FALLBACK : Enable layers marked to execute on GPU if layer cannot execute on DLA

REFIT : Enable building a refittable engine

DISABLE_TIMING_CACHE : Disable reuse of timing information across identical layers.

TF32 : Allow (but not require) computations on tensors of type DataType.FLOAT to use TF32. TF32 computes inner products by rounding the inputs to 10-bit mantissas before multiplying, but accumulates the sum using 23-bit mantissas. Enabled by default.

SPARSE_WEIGHTS : Allow the builder to examine weights and use optimized functions when weights have suitable sparsity.

SAFETY_SCOPE : Change the allowed parameters in the EngineCapability.STANDARD flow to match the restrictions that EngineCapability.SAFETY check against for DeviceType.GPU and EngineCapability.DLA_STANDALONE check against the DeviceType.DLA case. This flag is forced to true if EngineCapability.SAFETY at build time if it is unset.

OBEY_PRECISION_CONSTRAINTS : Require that layers execute in specified precisions. Build fails otherwise.

PREFER_PRECISION_CONSTRAINTS : Prefer that layers execute in specified precisions. Fall back (with warning) to another precision if build would otherwise fail.

DIRECT_IO : Require that no reformats be inserted between a layer and a network I/O tensor for which ITensor.allowed_formats was set. Build fails if a reformat is required for functional correctness.

REJECT_EMPTY_ALGORITHMS : Fail if IAlgorithmSelector.select_algorithms returns an empty set of algorithms.

VERSION_COMPATIBLE : Restrict to lean runtime operators to provide version forward compatibility for the plan files.

EXCLUDE_LEAN_RUNTIME : Exclude lean runtime from the plan.

FP8 : Enable FP8 layer selection

ERROR_ON_TIMING_CACHE_MISS : Emit error when a tactic being timed is not present in the timing cache.

DISABLE_COMPILATION_CACHE : Disable caching JIT compilation results during engine build.

WEIGHTLESSStrip the perf-irrelevant weights from the plan file, update them later using refitting for better file size.

[DEPRECATED] Deprecated in TensorRT 10.0.

STRIP_PLAN : Strip the refittable weights from the engine plan file.

REFIT_IDENTICAL : Create a refittable engine using identical weights. Different weights during refits yield unpredictable behavior.

WEIGHT_STREAMING : Enable building with the ability to stream varying amounts of weights during Runtime. This decreases GPU memory of TRT at the expense of performance.

tensorrt.PreviewFeature
List of Preview Features that can be enabled. Preview Features have been fully tested but are not yet as stable as other features in TensorRT.

They are provided as opt-in features for at least one release. For example, to enable faster dynamic shapes, call set_preview_feature() with PreviewFeature.PROFILE_SHARING_0806

Members:

PROFILE_SHARING_0806 :

[DEPRECATED] Allows optimization profiles to be shared across execution contexts. The default value for this flag is on in TensorRT 10.0. Turning if off is deprecated.

tensorrt.MemoryPoolType

The type for memory pools used by TensorRT.

Members:

WORKSPACE :

WORKSPACE is used by TensorRT to store intermediate buffers within an operation. This defaults to max device memory. Set to a smaller value to restrict tactics that use over the threshold en masse. For more targeted removal of tactics use the IAlgorithmSelector interface.

DLA_MANAGED_SRAM :

DLA_MANAGED_SRAM is a fast software managed RAM used by DLA to communicate within a layer. The size of this pool must be at least 4 KiB and must be a power of 2. This defaults to 1 MiB. Orin has capacity of 1 MiB per core.

DLA_LOCAL_DRAM :

DLA_LOCAL_DRAM is host RAM used by DLA to share intermediate tensor data across operations. The size of this pool must be at least 4 KiB and must be a power of 2. This defaults to 1 GiB.

DLA_GLOBAL_DRAM :

DLA_GLOBAL_DRAM is host RAM used by DLA to store weights and metadata for execution. The size of this pool must be at least 4 KiB and must be a power of 2. This defaults to 512 MiB.

TACTIC_DRAM :

TACTIC_DRAM is the host DRAM used by the optimizer to run tactics. On embedded devices, where host and device memory are unified, this includes all device memory required by TensorRT to build the network up to the point of each memory allocation. This defaults to 75% of totalGlobalMem as reported by cudaGetDeviceProperties when cudaGetDeviceProperties.embedded is true, and 100% otherwise.

TACTIC_SHARED_MEMORY :

TACTIC_SHARED_MEMORY defines the maximum shared memory size utilized for executing the backend CUDA kernel implementation. Adjust this value to restrict tactics that exceed the specified threshold en masse. The default value is device max capability. This value must be less than 1GiB.

Updating this flag will override the shared memory limit set by ref HardwareCompatibilityLevel, which defaults to 48KiB.

tensorrt.HardwareCompatibilityLevel
Describes requirements of compatibility with GPU architectures other than that of the GPU on which the engine was

built. Levels except NONE are only supported for engines built on NVIDIA Ampere and later GPUs. Note that compatibility with future hardware depends on CUDA forward compatibility support.

Members:

NONE :

Do not require hardware compatibility with GPU architectures other than that of the GPU on which the engine was built.

AMPERE_PLUS :

Require that the engine is compatible with Ampere and newer GPUs. This will limit the max shared memory usage to 48KiB, may reduce the number of available tactics for each layer, and may prevent some fusions from occurring. Thus this can decrease the performance, especially for tf32 models. This option will disable cuDNN, cuBLAS, and cuBLAS LT as tactic sources.

class tensorrt.IBuilderConfig
Variables
  • avg_timing_iterationsint The number of averaging iterations used when timing layers. When timing layers, the builder minimizes over a set of average times for layer execution. This parameter controls the number of iterations used in averaging. By default the number of averaging iterations is 1.

  • int8_calibratorIInt8Calibrator Int8 Calibration interface. The calibrator is to minimize the information loss during the INT8 quantization process.

  • flagsint The build mode flags to turn on builder options for this network. The flags are listed in the BuilderFlags enum. The flags set configuration options to build the network. This should be in integer consisting of one or more BuilderFlag s, combined via binary OR. For example, 1 << BuilderFlag.FP16 | 1 << BuilderFlag.DEBUG.

  • profile_streamint The handle for the CUDA stream that is used to profile this network.

  • num_optimization_profilesint The number of optimization profiles.

  • default_device_typetensorrt.DeviceType The default DeviceType to be used by the Builder.

  • DLA_coreint The DLA core that the engine executes on. Must be between 0 and N-1 where N is the number of available DLA cores.

  • profiling_verbosity – Profiling verbosity in NVTX annotations.

  • engine_capability – The desired engine capability. See EngineCapability for details.

  • algorithm_selector – The IAlgorithmSelector to use.

  • builder_optimization_level – The builder optimization level which TensorRT should build the engine at. Setting a higher optimization level allows TensorRT to spend longer engine building time searching for more optimization options. The resulting engine may have better performance compared to an engine built with a lower optimization level. The default optimization level is 3. Valid values include integers from 0 to the maximum optimization level, which is currently 5. Setting it to be greater than the maximum level results in identical behavior to the maximum level.

  • hardware_compatibility_level – Hardware compatibility allows an engine compatible with GPU architectures other than that of the GPU on which the engine was built.

  • plugins_to_serialize – The plugin libraries to be serialized with forward-compatible engines.

  • max_aux_streams – The maximum number of auxiliary streams that TRT is allowed to use. If the network contains operators that can run in parallel, TRT can execute them using auxiliary streams in addition to the one provided to the IExecutionContext::enqueueV3() call. The default maximum number of auxiliary streams is determined by the heuristics in TensorRT on whether enabling multi-stream would improve the performance. This behavior can be overridden by calling this API to set the maximum number of auxiliary streams explicitly. Set this to 0 to enforce single-stream inference. The resulting engine may use fewer auxiliary streams than the maximum if the network does not contain enough parallelism or if TensorRT determines that using more auxiliary streams does not help improve the performance. Allowing more auxiliary streams does not always give better performance since there will be synchronizations overhead between streams. Using CUDA graphs at runtime can help reduce the overhead caused by cross-stream synchronizations. Using more auxiliary leads to more memory usage at runtime since some activation memory blocks will not be able to be reused.

  • progress_monitor – The IProgressMonitor to use.

Below are the descriptions about each builder optimization level:

  • Level 0: This enables the fastest compilation by disabling dynamic kernel generation and selecting the first tactic that succeeds in execution. This will also not respect a timing cache.

  • Level 1: Available tactics are sorted by heuristics, but only the top are tested to select the best. If a dynamic kernel is generated its compile optimization is low.

  • Level 2: Available tactics are sorted by heuristics, but only the fastest tactics are tested to select the best.

  • Level 3: Apply heuristics to see if a static precompiled kernel is applicable or if a new one has to be compiled dynamically.

  • Level 4: Always compiles a dynamic kernel.

  • Level 5: Always compiles a dynamic kernel and compares it to static kernels.

__del__(self: tensorrt.tensorrt.IBuilderConfig) None
__exit__(exc_type, exc_value, traceback)

Context managers are deprecated and have no effect. Objects are automatically freed when the reference count reaches 0.

__init__(*args, **kwargs)
add_optimization_profile(self: tensorrt.tensorrt.IBuilderConfig, profile: tensorrt.tensorrt.IOptimizationProfile) int

Add an optimization profile.

This function must be called at least once if the network has dynamic or shape input tensors.

Parameters

profile – The new optimization profile, which must satisfy bool(profile) == True

Returns

The index of the optimization profile (starting from 0) if the input is valid, or -1 if the input is not valid.

can_run_on_DLA(self: tensorrt.tensorrt.IBuilderConfig, layer: tensorrt.tensorrt.ILayer) bool

Check if the layer can run on DLA.

Parameters

layer – The layer to check

Returns

A bool indicating whether the layer can run on DLA

clear_flag(self: tensorrt.tensorrt.IBuilderConfig, flag: tensorrt.tensorrt.BuilderFlag) None

Clears the builder mode flag from the enabled flags.

Parameters

flag – The flag to clear.

clear_quantization_flag(self: tensorrt.tensorrt.IBuilderConfig, flag: tensorrt.tensorrt.QuantizationFlag) None

Clears the quantization flag from the enabled quantization flags.

Parameters

flag – The flag to clear.

create_timing_cache(self: tensorrt.tensorrt.IBuilderConfig, serialized_timing_cache: buffer) tensorrt.tensorrt.ITimingCache

Create timing cache

Create ITimingCache instance from serialized raw data. The created timing cache doesn’t belong to a specific builder config. It can be shared by multiple builder instances

Parameters

serialized_timing_cache – The serialized timing cache. If an empty cache is provided (i.e. b""), a new cache will be created.

Returns

The created ITimingCache object.

get_calibration_profile(self: tensorrt.tensorrt.IBuilderConfig) tensorrt.tensorrt.IOptimizationProfile

Get the current calibration profile.

Returns

The current calibration profile or None if calibrartion profile is unset.

get_device_type(self: tensorrt.tensorrt.IBuilderConfig, layer: tensorrt.tensorrt.ILayer) tensorrt.tensorrt.DeviceType

Get the device that the layer executes on.

Parameters

layer – The layer to get the DeviceType for

Returns

The DeviceType of the layer

get_flag(self: tensorrt.tensorrt.IBuilderConfig, flag: tensorrt.tensorrt.BuilderFlag) bool

Check if a build mode flag is set.

Parameters

flag – The flag to check.

Returns

A bool indicating whether the flag is set.

get_memory_pool_limit(self: tensorrt.tensorrt.IBuilderConfig, pool: tensorrt.tensorrt.MemoryPoolType) int

Retrieve the memory size limit of the corresponding pool in bytes. If set_memory_pool_limit() for the pool has not been called, this returns the default value used by TensorRT. This default value is not necessarily the maximum possible value for that configuration.

Parameters

pool – The memory pool to get the limit for.

Returns

The size of the memory limit, in bytes, for the corresponding pool.

get_preview_feature(self: tensorrt.tensorrt.IBuilderConfig, feature: tensorrt.tensorrt.PreviewFeature) bool

Check if a preview feature is enabled.

Parameters

feature – the feature to query

Returns

true if the feature is enabled, false otherwise

get_quantization_flag(self: tensorrt.tensorrt.IBuilderConfig, flag: tensorrt.tensorrt.QuantizationFlag) bool

Check if a quantization flag is set.

Parameters

flag – The flag to check.

Returns

A bool indicating whether the flag is set.

get_tactic_sources(self: tensorrt.tensorrt.IBuilderConfig) int

Get the tactic sources currently set in the engine build configuration.

get_timing_cache(self: tensorrt.tensorrt.IBuilderConfig) tensorrt.tensorrt.ITimingCache

Get the timing cache from current IBuilderConfig

Returns

The timing cache used in current IBuilderConfig, or None if no timing cache is set.

is_device_type_set(self: tensorrt.tensorrt.IBuilderConfig, layer: tensorrt.tensorrt.ILayer) bool

Check if the DeviceType for a layer is explicitly set.

Parameters

layer – The layer to check for DeviceType

Returns

True if DeviceType is not default, False otherwise

reset(self: tensorrt.tensorrt.IBuilderConfig) None

Resets the builder configuration to defaults. When initializing a builder config object, we can call this function.

reset_device_type(self: tensorrt.tensorrt.IBuilderConfig, layer: tensorrt.tensorrt.ILayer) None

Reset the DeviceType for the given layer.

Parameters

layer – The layer to reset the DeviceType for

set_calibration_profile(self: tensorrt.tensorrt.IBuilderConfig, profile: tensorrt.tensorrt.IOptimizationProfile) bool

Set a calibration profile.

Calibration optimization profile must be set if int8 calibration is used to set scales for a network with runtime dimensions.

Parameters

profile – The new calibration profile, which must satisfy bool(profile) == True or be None. MIN and MAX values will be overwritten by OPT.

Returns

True if the calibration profile was set correctly.

set_device_type(self: tensorrt.tensorrt.IBuilderConfig, layer: tensorrt.tensorrt.ILayer, device_type: tensorrt.tensorrt.DeviceType) None

Set the device that this layer must execute on. If DeviceType is not set or is reset, TensorRT will use the default DeviceType set in the builder.

The DeviceType for a layer must be compatible with the safety flow (if specified). For example a layer cannot be marked for DLA execution while the builder is configured for SAFE_GPU.

Parameters
  • layer – The layer to set the DeviceType of

  • device_type – The DeviceType the layer must execute on

set_flag(self: tensorrt.tensorrt.IBuilderConfig, flag: tensorrt.tensorrt.BuilderFlag) None

Add the input builder mode flag to the already enabled flags.

Parameters

flag – The flag to set.

set_memory_pool_limit(self: tensorrt.tensorrt.IBuilderConfig, pool: tensorrt.tensorrt.MemoryPoolType, pool_size: int) None

Set the memory size for the memory pool.

TensorRT layers access different memory pools depending on the operation. This function sets in the IBuilderConfig the size limit, specified by pool_size, for the corresponding memory pool, specified by pool. TensorRT will build a plan file that is constrained by these limits or report which constraint caused the failure.

If the size of the pool, specified by pool_size, fails to meet the size requirements for the pool, this function does nothing and emits the recoverable error, ErrorCode.INVALID_ARGUMENT, to the registered IErrorRecorder .

If the size of the pool is larger than the maximum possible value for the configuration, this function does nothing and emits ErrorCode.UNSUPPORTED_STATE.

If the pool does not exist on the requested device type when building the network, a warning is emitted to the logger, and the memory pool value is ignored.

Refer to MemoryPoolType to see the size requirements for each pool.

Parameters
  • pool – The memory pool to limit the available memory for.

  • pool_size – The size of the pool in bytes.

set_preview_feature(self: tensorrt.tensorrt.IBuilderConfig, feature: tensorrt.tensorrt.PreviewFeature, enable: bool) None

Enable or disable a specific preview feature.

Allows enabling or disabling experimental features, which are not enabled by default in the current release. Preview Features have been fully tested but are not yet as stable as other features in TensorRT. They are provided as opt-in features for at least one release.

Refer to PreviewFeature for additional information, and a list of the available features.

Parameters
  • feature – the feature to enable

  • enable – whether to enable or disable

set_quantization_flag(self: tensorrt.tensorrt.IBuilderConfig, flag: tensorrt.tensorrt.QuantizationFlag) None

Add the input quantization flag to the already enabled quantization flags.

Parameters

flag – The flag to set.

set_tactic_sources(self: tensorrt.tensorrt.IBuilderConfig, tactic_sources: int) bool

Set tactic sources.

This bitset controls which tactic sources TensorRT is allowed to use for tactic selection.

Multiple tactic sources may be combined with a bitwise OR operation. For example, to enable cublas and cublasLt as tactic sources, use a value of: 1 << int(trt.TacticSource.CUBLAS) | 1 << int(trt.TacticSource.CUBLAS_LT)

Parameters

tactic_sources – The tactic sources to set

Returns

A bool indicating whether the tactic sources in the build configuration were updated. The tactic sources in the build configuration will not be updated if the provided value is invalid.

set_timing_cache(self: tensorrt.tensorrt.IBuilderConfig, cache: tensorrt.tensorrt.ITimingCache, ignore_mismatch: bool) bool

Attach a timing cache to IBuilderConfig

The timing cache has verification header to make sure the provided cache can be used in current environment. A failure will be reported if the CUDA device property in the provided cache is different from current environment. bool(ignore_mismatch) == True skips strict verification and allows loading cache created from a different device. The cache must not be destroyed until after the engine is built.

Parameters
  • cache – The timing cache to be used

  • ignore_mismatch – Whether or not allow using a cache that contains different CUDA device property

Returns

A BOOL indicating whether the operation is done successfully.