IExecutionContext
- class tensorrt.IOutputAllocator(self: tensorrt.tensorrt.IOutputAllocator) None
Application-implemented class for controlling output tensor allocation.
To implement a custom output allocator, ensure that you explicitly instantiate the base class in
__init__():class MyOutputAllocator(trt.IOutputAllocator): def __init__(self): trt.IOutputAllocator.__init__(self) def reallocate_output(self, tensor_name, memory, size, alignment): ... # Your implementation here def notify_shape(self, tensor_name, shape): ... # Your implementation here
- __init__(self: tensorrt.tensorrt.IOutputAllocator) None
- notify_shape(self: tensorrt.tensorrt.IOutputAllocator, tensor_name: str, shape: tensorrt.tensorrt.Dims) None
Called by TensorRT when the shape of the output tensor is known.
- Parameters
tensor_name – The output tensor name.
shape – The output tensor shape.
- reallocate_output(self: tensorrt.tensorrt.IOutputAllocator, tensor_name: str, memory: capsule, size: int, alignment: int) capsule
A callback implemented by the application to handle acquisition of output tensor memory.
If an allocation request cannot be satisfied,
Noneshould be returned.- Parameters
tensor_name – The output tensor name.
memory – The output tensor memory address.
size – The number of bytes required.
alignment – The required alignment of memory.
- Returns
The address of the output tensor memory.
- class tensorrt.IExecutionContext
Context for executing inference using an
ICudaEngine. MultipleIExecutionContexts may exist for oneICudaEngineinstance, allowing the sameICudaEngineto be used for the execution of multiple batches simultaneously.- Variables
debug_sync –
boolThe debug sync flag. If this flag is set to true, theICudaEnginewill log the successful execution for each kernel during execute_v2(). It has no effect when using execute_async_v2().profiler –
IProfilerThe profiler in use by thisIExecutionContext.engine –
ICudaEngineThe associatedICudaEngine.name –
strThe name of theIExecutionContext.device_memory –
capsuleThe device memory for use by this execution context. The memory must be aligned on a 256-byte boundary, and its size must be at leastengine.device_memory_size. If usingexecute_async_v2()to run the network, The memory is in use from the invocation ofexecute_async_v2()until network execution is complete. If usingexecute_v2(), it is in use untilexecute_v2()returns. Releasing or otherwise using the memory for other purposes during this time will result in undefined behavior.active_optimization_profile –
intThe active optimization profile for the context. The selected profile will be used in subsequent calls toexecute_v2()orexecute_async_v2(). Profile 0 is selected by default. Changing this value will invalidate all dynamic bindings for the current execution context, so that they have to be set again usingset_binding_shape()before calling eitherexecute_v2()orexecute_async_v2().all_binding_shapes_specified –
boolWhether all dynamic dimensions of input tensors have been specified by callingset_binding_shape(). Trivially true if network has no dynamically shaped input tensors. Does not work with name-base interfaces eg.set_input_shape(). Useinfer_shapes()instead.all_shape_inputs_specified –
boolWhether values for all input shape tensors have been specified by callingset_shape_input(). Trivially true if network has no input shape bindings. Does not work with name-base interfaces eg.set_input_shape(). Useinfer_shapes()instead.error_recorder –
IErrorRecorderApplication-implemented error reporting interface for TensorRT objects.enqueue_emits_profile –
boolWhether enqueue emits layer timing to the profiler. The default value isTrue. If set toFalse, enqueue will be asynchronous if there is a profiler attached. An extra methodIExecutionContext::report_to_profiler()needs to be called to obtain the profiling data and report to the profiler attached.persistent_cache_limit – The maximum size of persistent L2 cache that this execution context may use for activation caching. Activation caching is not supported on all architectures - see “How TensorRT uses Memory” in the developer guide for details. The default is 0 Bytes.
nvtx_verbosity – The NVTX verbosity of the execution context. Building with kDETAILED verbosity will generally increase latency in enqueueV2/V3(). Call this method to select NVTX verbosity in this execution context at runtime. The default is the verbosity with which the engine was built, and the verbosity may not be raised above that level. This function does not affect how IEngineInspector interacts with the engine.
temporary_allocator –
IGpuAllocatorThe GPU allocator used for internal temporary storage.
- __del__(self: tensorrt.tensorrt.IExecutionContext) None
- __exit__(exc_type, exc_value, traceback)
Context managers are deprecated and have no effect. Objects are automatically freed when the reference count reaches 0.
- __init__(*args, **kwargs)
- execute(self: tensorrt.tensorrt.IExecutionContext, batch_size: int = 1, bindings: List[int]) bool
[DEPRECATED] Please use execute_v2() instead if the engine is built from a network with explicit batch dimension mode enabled.
Synchronously execute inference on a batch. This method requires a array of input and output buffers. The mapping from tensor names to indices can be queried using
ICudaEngine.get_binding_index().- Parameters
batch_size – The batch size. This is at most the value supplied when the
ICudaEnginewas built. This has no effect if the engine is built from a network with explicit batch dimension mode enabled.bindings – A list of integers representing input and output buffer addresses for the network.
- Returns
True if execution succeeded.
- execute_async(self: tensorrt.tensorrt.IExecutionContext, batch_size: int = 1, bindings: List[int], stream_handle: int, input_consumed: capsule = None) bool
[DEPRECATED] Please use execute_async_v2() instead if the engine is built from a network with explicit batch dimension mode enabled.
Asynchronously execute inference on a batch. This method requires a array of input and output buffers. The mapping from tensor names to indices can be queried using
ICudaEngine::get_binding_index().- Parameters
batch_size – The batch size. This is at most the value supplied when the
ICudaEnginewas built. This has no effect if the engine is built from a network with explicit batch dimension mode enabled.bindings – A list of integers representing input and output buffer addresses for the network.
stream_handle – A handle for a CUDA stream on which the inference kernels will be executed.
input_consumed – An optional event which will be signaled when the input buffers can be refilled with new data
- Returns
True if the kernels were executed successfully.
- execute_async_v2(self: tensorrt.tensorrt.IExecutionContext, bindings: List[int], stream_handle: int, input_consumed: capsule = None) bool
Asynchronously execute inference on a batch. This method requires a array of input and output buffers. The mapping from tensor names to indices can be queried using
ICudaEngine::get_binding_index(). This method only works for execution contexts built from networks with no implicit batch dimension.- Parameters
bindings – A list of integers representing input and output buffer addresses for the network.
stream_handle – A handle for a CUDA stream on which the inference kernels will be executed.
input_consumed – An optional event which will be signaled when the input buffers can be refilled with new data
- Returns
True if the kernels were executed successfully.
- execute_async_v3(self: tensorrt.tensorrt.IExecutionContext, stream_handle: int) bool
Asynchronously execute inference.
Modifying or releasing memory that has been registered for the tensors before stream synchronization or the event passed to
set_input_consumed_event()has been triggered results in undefined behavior.Input tensors can be released after the
set_input_consumed_event()whereas output tensors require stream synchronization.- Parameters
stream_handle – The cuda stream on which the inference kernels will be enqueued.
- execute_v2(self: tensorrt.tensorrt.IExecutionContext, bindings: List[int]) bool
Synchronously execute inference on a batch. This method requires a array of input and output buffers. The mapping from tensor names to indices can be queried using
ICudaEngine.get_binding_index(). This method only works for execution contexts built from networks with no implicit batch dimension.- Parameters
bindings – A list of integers representing input and output buffer addresses for the network.
- Returns
True if execution succeeded.
- get_binding_shape(self: tensorrt.tensorrt.IExecutionContext, binding: int) tensorrt.tensorrt.Dims
Get the dynamic shape of a binding.
If
set_binding_shape()has been called on this binding (or if there are no dynamic dimensions), all dimensions will be positive. Otherwise, it is necessary to callset_binding_shape()beforeexecute_async_v2()orexecute_v2()may be called.If the
bindingis out of range, an invalid Dims with nbDims == -1 is returned.If
ICudaEngine.binding_is_input(binding)isFalse, then bothall_binding_shapes_specifiedandall_shape_inputs_specifiedmust beTruebefore calling this method.- Parameters
binding – The binding index.
- Returns
A
Dimsobject representing the currently selected shape.
- get_input_consumed_event(self: tensorrt.tensorrt.IExecutionContext) int
Return the event associated with consuming the input tensors.
- get_max_output_size(self: tensorrt.tensorrt.IExecutionContext, name: str) int
Return the upper bound on an output tensor’s size, in bytes, based on the current optimization profile.
If the profile or input shapes are not yet set, or the provided name does not map to an output, returns -1.
- Parameters
name – The tensor name.
- get_output_allocator(self: tensorrt.tensorrt.IExecutionContext, name: str) nvinfer1::IOutputAllocator
Return the output allocator associated with given output tensor, or
Noneif the provided name does not map to an output tensor.- Parameters
name – The tensor name.
- get_shape(self: tensorrt.tensorrt.IExecutionContext, binding: int) List[int]
Get values of an input shape tensor required for shape calculations or an output tensor produced by shape calculations.
- Parameters
binding – The binding index of an input tensor for which
ICudaEngine.is_shape_binding(binding)is true.
If
ICudaEngine.binding_is_input(binding) == False, then bothall_binding_shapes_specifiedandall_shape_inputs_specifiedmust beTruebefore calling this method.- Returns
An iterable containing the values of the shape tensor.
- get_strides(self: tensorrt.tensorrt.IExecutionContext, binding: int) tensorrt.tensorrt.Dims
Return the strides of the buffer for the given binding.
Note that strides can be different for different execution contexts with dynamic shapes.
- Parameters
binding – The binding index.
- get_tensor_address(self: tensorrt.tensorrt.IExecutionContext, name: str) int
Get memory address for the given input or output tensor.
- Parameters
name – The tensor name.
- get_tensor_shape(self: tensorrt.tensorrt.IExecutionContext, name: str) tensorrt.tensorrt.Dims
Return the shape of the given input or output tensor.
- Parameters
name – The tensor name.
- get_tensor_strides(self: tensorrt.tensorrt.IExecutionContext, name: str) tensorrt.tensorrt.Dims
Return the strides of the buffer for the given tensor name.
Note that strides can be different for different execution contexts with dynamic shapes.
- Parameters
name – The tensor name.
- infer_shapes(self: tensorrt.tensorrt.IExecutionContext) List[str]
Infer shapes and return the names of any tensors that are insufficiently specified.
An input tensor is insufficiently specified if either of the following is true:
It has dynamic dimensions and its runtime dimensions have not yet been specified via
set_input_shape().is_shape_inference_io(t) is True and the tensor’s address has not yet been set.
- Returns
A
List[str]indicating the names of any tensors which have not been sufficiently specified, or an empty list on success.- Raises
RuntimeError if shape inference fails due to reasons other than insufficiently specified tensors.
- report_to_profiler(self: tensorrt.tensorrt.IExecutionContext) bool
Calculate layer timing info for the current optimization profile in IExecutionContext and update the profiler after one iteration of inference launch.
If the enqueue_emits_profiler flag was set to true, the enqueue function will calculate layer timing implicitly if a profiler is provided. There is no need to call this function. If the enqueue_emits_profiler flag was set to false, the enqueue function will record the CUDA event timers if a profiler is provided. But it will not perform the layer timing calculation. This function needs to be called explicitly to calculate layer timing for the previous inference launch.
In the CUDA graph launch scenario, it will record the same set of CUDA events as in regular enqueue functions if the graph is captured from an
IExecutionContextwith profiler enabled. This function needs to be called after graph launch to report the layer timing info to the profiler.Profiling CUDA graphs is only available from CUDA 11.1 onwards.
- Returns
Trueif the call succeeded, elseFalse(e.g. profiler not provided, in CUDA graph capture mode, etc.)
- set_aux_streams(self: tensorrt.tensorrt.IExecutionContext, aux_streams: List[int]) None
Set the auxiliary streams that TensorRT should launch kernels on in the next execute_async_v3() call.
If set, TensorRT will launch the kernels that are supposed to run on the auxiliary streams using the streams provided by the user with this API. If this API is not called before the execute_async_v3() call, then TensorRT will use the auxiliary streams created by TensorRT internally.
- TensorRT will always insert event synchronizations between the main stream provided via execute_async_v3() call and the auxiliary streams:
At the beginning of the execute_async_v3() call, TensorRT will make sure that all the auxiliary streams wait on the activities on the main stream.
At the end of the execute_async_v3() call, TensorRT will make sure that the main stream wait on the activities on all the auxiliary streams.
The provided auxiliary streams must not be the default stream and must all be different to avoid deadlocks.
- Parameters
aux_streams – A list of cuda streams. If the length of the list is greater than engine.num_aux_streams, then only the first “engine.num_aux_streams” streams will be used. If the length is less than engine.num_aux_streams, such as an empty list, then TensorRT will use the provided streams for the first few auxiliary streams, and will create additional streams internally for the rest of the auxiliary streams.
- set_binding_shape(self: tensorrt.tensorrt.IExecutionContext, binding: int, shape: tensorrt.tensorrt.Dims) bool
Set the dynamic shape of a binding.
Requires the engine to be built without an implicit batch dimension. The binding must be an input tensor, and all dimensions must be compatible with the network definition (i.e. only the wildcard dimension -1 can be replaced with a new dimension > 0). Furthermore, the dimensions must be in the valid range for the currently selected optimization profile.
For all dynamic non-output bindings (which have at least one wildcard dimension of -1), this method needs to be called after setting
active_optimization_profilebefore eitherexecute_async_v2()orexecute_v2()may be called. When all input shapes have been specified,all_binding_shapes_specifiedis set toTrue.- Parameters
binding – The binding index.
shape – The shape to set.
- Returns
Falseif an error occurs (e.g. specified binding is out of range for the currently selected optimization profile or specified shape is inconsistent with min-max range of the optimization profile), elseTrue.
Note that the network can still be invalid for certain combinations of input shapes that lead to invalid output shapes. To confirm the correctness of the network input shapes, check whether the output binding has valid shape using
get_binding_shape()on the output binding.
- set_input_consumed_event(self: tensorrt.tensorrt.IExecutionContext, event: int) bool
Mark all input tensors as consumed.
- Parameters
event – The cuda event that is triggered after all input tensors have been consumed.
- set_input_shape(self: tensorrt.tensorrt.IExecutionContext, name: str, shape: tensorrt.tensorrt.Dims) bool
Set shape for the given input tensor.
- Parameters
name – The input tensor name.
shape – The input tensor shape.
- set_optimization_profile_async(self: tensorrt.tensorrt.IExecutionContext, profile_index: int, stream_handle: int) bool
Set the optimization profile with async semantics
- Parameters
profile_index – The index of the optimization profile
stream_handle – cuda stream on which the work to switch optimization profile can be enqueued
When an optimization profile is switched via this API, TensorRT may require that data is copied via cudaMemcpyAsync. It is the application’s responsibility to guarantee that synchronization between the profile sync stream and the enqueue stream occurs.
- Returns
Trueif the optimization profile was set successfully
- set_output_allocator(self: tensorrt.tensorrt.IExecutionContext, name: str, output_allocator: nvinfer1::IOutputAllocator) bool
Set output allocator to use for the given output tensor.
Pass
Noneto unset the output allocator.The allocator is called by
execute_async_v3().- Parameters
name – The tensor name.
output_allocator – The output allocator.
- set_shape_input(self: tensorrt.tensorrt.IExecutionContext, binding: int, shape: List[int]) bool
Set values of an input shape tensor required by shape calculations.
- Parameters
binding – The binding index of an input tensor for which
ICudaEngine.is_shape_binding(binding)andICudaEngine.binding_is_input(binding)are both true.shape – An iterable containing the values of the input shape tensor. The number of values should be the product of the dimensions returned by
get_binding_shape(binding).
If
ICudaEngine.is_shape_binding(binding)andICudaEngine.binding_is_input(binding)are both true, this method must be called beforeexecute_async_v2()orexecute_v2()may be called. Additionally, this method must not be called if eitherICudaEngine.is_shape_binding(binding)orICudaEngine.binding_is_input(binding)are false.- Returns
Falseif an error occurs (e.g. specified binding is out of range for the currently selected optimization profile or specified shape values are inconsistent with min-max range of the optimization profile), elseTrue.
Note that the network can still be invalid for certain combinations of input shapes that lead to invalid output shapes. To confirm the correctness of the network input shapes, check whether the output binding has valid shape using
get_binding_shape()on the output binding.
- set_tensor_address(self: tensorrt.tensorrt.IExecutionContext, name: str, memory: int) bool
Set memory address for the given input or output tensor.
- Parameters
name – The tensor name.
memory – The memory address.