Runners

Module: polygraphy.backend.trt

class TrtRunner(engine, name: str | None = None, optimization_profile: int | None = None, allocation_strategy: str | None = None, weight_streaming_budget: int | None = None, weight_streaming_percent: float | None = None)[source]

Bases: BaseRunner

Runs inference using TensorRT.

Note that runners are not designed for production deployment and should generally be used only for prototyping, testing, and debugging.

Parameters:
  • engine (Union[Union[trt.ICudaEngine, trt.IExecutionContext], Callable() -> Union[trt.ICudaEngine, trt.IExecutionContext]]) – A TensorRT engine or execution context or a callable that returns one. If an engine is provided, the runner will create a context automatically.

  • name (str) – The human-readable name prefix to use for this runner. A runner count and timestamp will be appended to this prefix.

  • optimization_profile (int) – The index of the optimization profile to set each time this runner is activated. When this is not provided, the profile is not set explicitly and will default to the 0th profile. You can also change the profile after the runner is active using the set_profile() method.

  • allocation_strategy (str) –

    The way device memory (internal activation and scratch memory) is allocated for the execution context. The value of this argument can be:
    • ”static”: The default value. The execution context will pre-allocate a block of memory that is sufficient for any possible input size across all profiles.

    • ”profile”: Allocate device memory enough for the current profile based on profile max shapes.

    • ”runtime”: Allocate device meomry enough for the current input shapes.

  • weight_streaming_budget (int) –

    The amount of GPU memory that TensorRT can use for weights at runtime. Tt can take on the following values:

    None or 0: Disables weight streaming at runtime. -1: TensorRT will decide the streaming budget automatically. > 0: The maximum amount of GPU memory TensorRT is allowed to use for weights in bytes.

  • weight_streaming_percent (float) –

    The percentage of weights that TRT will stream from CPU to GPU. It can take on the following values:

    None or 0: Disables weight streaming at runtime. [0 to 100]: The percentage of weights TRT will stream. 100 will stream the maximum number of weights.

set_profile(index: int)[source]

Sets the active optimization profile for this runner. The runner must already be active (see __enter__() or activate()).

This only applies if your engine was built with multiple optimization profiles.

In TensorRT 8.0 and newer, the profile will be set asynchronously using this runner’s CUDA stream (runner.stream).

By default, the runner uses the first profile (profile 0).

Parameters:

index (int) – The index of the optimization profile to use.

infer_impl(feed_dict, copy_outputs_to_host=None)[source]

Implementation for running inference with TensorRT. Do not call this method directly - use infer() instead, which will forward unrecognized arguments to this method.

Parameters:
  • feed_dict (OrderedDict[str, Union[numpy.ndarray, DeviceView, torch.Tensor]]) – A mapping of input tensor names to corresponding input NumPy arrays, Polygraphy DeviceViews, or PyTorch tensors. If PyTorch tensors are provided in the feed_dict, then this function will return the outputs also as PyTorch tensors. If the provided inputs already reside in GPU memory, no additional copies are made.

  • copy_outputs_to_host (bool) – Whether to copy inference outputs back to host memory. If this is False, PyTorch GPU tensors or Polygraphy DeviceViews are returned instead of PyTorch CPU tensors or NumPy arrays respectively. Defaults to True.

Returns:

A mapping of output tensor names to corresponding output NumPy arrays, Polygraphy DeviceViews, or PyTorch tensors.

Return type:

OrderedDict[str, Union[numpy.ndarray, DeviceView, torch.Tensor]]

__enter__()

Activate the runner for inference. For example, this may involve allocating CPU or GPU memory.

__exit__(exc_type, exc_value, traceback)

Deactivate the runner. For example, this may involve freeing CPU or GPU memory.

activate()

Activate the runner for inference. For example, this may involve allocating CPU or GPU memory.

Generally, you should use a context manager instead of manually activating and deactivating. For example:

with RunnerType(...) as runner:
    runner.infer(...)
deactivate()

Deactivate the runner. For example, this may involve freeing CPU or GPU memory.

Generally, you should use a context manager instead of manually activating and deactivating. For example:

with RunnerType(...) as runner:
    runner.infer(...)
get_input_metadata(use_numpy_dtypes=None)

Returns information about the inputs of the model. Shapes here may include dynamic dimensions, represented by None. Must be called only after activate() and before deactivate().

Parameters:

use_numpy_dtypes (bool) – [DEPRECATED] Whether to return NumPy data types instead of Polygraphy DataType s. This is provided to retain backwards compatibility. In the future, this parameter will be removed and Polygraphy DataType s will always be returned. These can be converted to NumPy data types by calling the numpy() method. Defaults to True.

Returns:

Input names, shapes, and data types.

Return type:

TensorMetadata

infer(feed_dict, check_inputs=True, *args, **kwargs)

Runs inference using the provided feed_dict.

Must be called only after activate() and before deactivate().

NOTE: Some runners may accept additional parameters in infer(). For details on these, see the documentation for their infer_impl() methods.

Parameters:
  • feed_dict (OrderedDict[str, numpy.ndarray]) – A mapping of input tensor names to corresponding input NumPy arrays.

  • check_inputs (bool) – Whether to check that the provided feed_dict includes the expected inputs with the expected data types and shapes. Disabling this may improve performance. Defaults to True.

inference_time

The time required to run inference in seconds.

Type:

float

Returns:

A mapping of output tensor names to their corresponding NumPy arrays.

IMPORTANT: Runners may reuse these output buffers. Thus, if you need to save outputs from multiple inferences, you should make a copy with copy.deepcopy(outputs).

Return type:

OrderedDict[str, numpy.ndarray]

last_inference_time()

Returns the total inference time in seconds required during the last call to infer().

Must be called only after activate() and before deactivate().

Returns:

The time in seconds, or None if runtime was not measured by the runner.

Return type:

float

is_active

Whether this runner has been activated, either via context manager, or by calling activate().

Type:

bool