Extending TensorRT with Custom Layers#

NVIDIA TensorRT supports many layers, and its functionality is continually extended; however, there can be cases in which the layers supported do not cater to a model’s specific needs. In such cases, TensorRT can be extended by implementing custom layers, often called plugins.

TensorRT contains standard plugins that can be loaded into your application. For a list of open-source plugins, refer to GitHub: TensorRT plugins.

To use standard TensorRT plugins in your application, the libnvinfer_plugin.so (nvinfer_plugin.dll on Windows) library must be loaded, and all plugins must be registered by calling initLibNvInferPlugins in your application code. For more information about these plugins, refer to the NvInferPlugin.h file.

You can write and add your own if these plugins do not meet your needs.

Adding Custom Layers Using the C++ API#

There are four steps to ensure that TensorRT properly recognizes your plugin:

Implement a plugin class from one of TensorRT’s plugin base classes. Currently, the only recommended one is IPluginV3.
Implement a plugin creator class tied to your class by deriving from one of TensorRT’s plugin creator-based classes. Currently, the only recommended one is IPluginCreatorV3One.
Register an instance of the plugin creator class with TensorRT’s plugin registry.
Add an instance of the plugin class to a TensorRT network by directly using TensorRT’s network APIs or loading an ONNX model using the TensorRT ONNX parser APIs.

The following sections explore each of these steps in detail.

Implementing a Plugin Class#

You can implement a custom layer by deriving from one of TensorRT’s plugin base classes. Starting in TensorRT 10.0, the only plugin interface recommended is IPluginV3, as others are deprecated. Therefore, this section mostly describes plugin implementation using IPluginV3. Refer to the Migrating V2 Plugins to IPluginV3 section for how plugins implementing V2 plugin interfaces can be migrated to IPluginV3.

IPluginV3 is a wrapper for a set of capability interfaces that define three capabilities: core, build, and runtime.

Core capability: Refers to plugin attributes and behaviors common to both the build and runtime phases of a plugin’s lifetime.
Build capability: Refers to plugin attributes and behaviors that the plugin must exhibit for the TensorRT builder.
Runtime capability: Refers to plugin attributes and behaviors that the plugin must exhibit for it to be executable, either during auto-tuning in the TensorRT build phase or inference in the TensorRT runtime phase.

IPluginV3OneCore (C++, Python), IPluginV3OneBuild (C++, Python), and IPluginV3OneRuntime (C++, Python) are the base classes that an IPluginV3 plugin must implement to display the core, build, and runtime capabilities, respectively. If I/O aliasing is required, IPluginV3OneBuildV2 (C++, Python) can be used as the build capability, which contains a superset of the functionalities in IPluginV3OneBuild.

Implementing a Plugin Creator Class#

To use a plugin in a network, you must first register it with TensorRT’s PluginRegistry (C++, Python). Rather than registering the plugin directly, you register an instance of a factory class for the plugin, derived from a child class of IPluginCreatorInterface (C++, Python). The plugin creator class also provides other information about the plugin: its name, version, and plugin field parameters.

IPluginCreatorV3One is the factory class for IPluginV3. IPluginCreatorV3One::createPlugin(), which has the signature below.

C++

IPluginV3* createPlugin(AsciiChar const *name, PluginFieldCollection const *fc, TensorRTPhase phase)

Python

create_plugin(self: trt.IPluginCreatorV3, name: str, field_collection: trt.PluginFieldCollection, phase: trt.TensorRTPhase) -> trt.IPluginV3

IPluginCreatorV3One::createPlugin() may be called to create a plugin instance in either the build phase of TensorRT or the runtime phase of TensorRT, which is communicated by the phase argument of type TensorRTPhase (C++, Python).

The returned IPluginV3 object must have a valid core capability in both phases.
In the build phase, the returned IPluginV3 object must have both a build and runtime capability.
In the runtime phase, the returned IPluginV3 object must have a runtime capability. A build capability is not required and is ignored.

Registering a Plugin Creator with the Plugin Registry#

There are two ways that you can register plugin creators with the registry:

Statically register by calling REGISTER_TENSORRT_PLUGIN. REGISTER_TENSORRT_PLUGIN always registers the creator under the default namespace (“”).
Dynamically register by creating an entry point similar to initLibNvInferPlugins and calling registerCreator on the plugin registry. This is preferred over static registration as it allows plugins to be registered under a unique namespace. This ensures no name collisions during build time across different plugin libraries.

During serialization, the TensorRT engine internally stores the plugin name, plugin version, and namespace (if it exists) for all plugins, along with any plugin fields in the PluginFieldCollection returned by IPluginV3OneRuntime::getFieldsToSerialize(). During deserialization, TensorRT looks up a plugin creator with the same plugin name, version, and namespace from the plugin registry and invokes IPluginCreatorV3One:::createPlugin() on it—the PluginFieldCollection that was serialized is passed back as the fc argument.

Adding a Plugin Instance to a TensorRT Network#

You can add a plugin to the TensorRT network using addPluginV3(), which creates a network layer with the given plugin.

For example, you can add a plugin layer to your network as follows:

// Look up the plugin in the registry
// Cast to appropriate child class of IPluginCreatorInterface
auto creator = static_cast<IPluginCreatorV3One*>(getPluginRegistry()->getCreator(pluginName, pluginVersion, pluginNamespace));
PluginFieldCollection const* pluginFC = creator->getFieldNames();
// Populate the field parameters for the plugin layer
PluginFieldCollection *pluginData = parseAndFillFields(pluginFC, layerFields);
// Create the plugin object using the layerName and the plugin metadata for use by the TensorRT builder
IPluginV3 *pluginObj = creator->createPlugin(layerName, pluginData, TensorRTPhase::kBUILD);
// Add the plugin to the TensorRT network
auto layer = network.addPluginV3(inputs.data(), int(inputs.size()),  shapeInputs.data(), int(shapeInputs.size()), pluginObj);
//… (build rest of the network and serialize engine)
// Delete the plugin object
delete pluginObj;d
// … (free allocated pluginData)

The createPlugin method described previously creates a new plugin object on the heap and returns a pointer. As shown previously, ensure you delete the pluginObj to avoid a memory leak.

When the engine is deleted, the engine destroys any clones of the plugin object created during the build. You are responsible for ensuring the plugin object you created is freed after it is added to the network.

Note

Do not serialize all plugin parameters, only those required to function correctly at runtime. Build time parameters can be omitted.
If you are an automotive safety user, you must call getSafePluginRegistry() instead of getPluginRegistry(). You must also use the macro REGISTER_SAFE_TENSORRT_PLUGIN instead of REGISTER_TENSORRT_PLUGIN.

Example: Adding a Custom Layer with Dynamic Shapes Using C++#

Imagine that a custom layer is needed for a padding-like operation where each image in an input batch must be reshaped to 32 x 32. The input tensor X would be of shape (B, C, H, W), and the output Y would be of shape (B, C, 32, 32). To accomplish this, a TensorRT plugin can be written using the IPluginV3 interface; let us call it PadPlugin.

Since an IPluginV3 plugin must possess multiple capabilities, each defined by a separate interface, you could implement a plugin using the principle of composition or multiple inheritance. However, a multiple inheritance approach is easier for most use cases, particularly when coupling build and runtime capabilities in a single class is tolerable.

Using multiple inheritance, PadPlugin can be implemented as follows:

class PadPlugin : public IPluginV3, public IPluginV3OneCore, public IPluginV3OneBuild, public IPluginV3OneRuntime
{
    ...override inherited virtual methods.
};

The override of IPluginV3::getCapabilityInterface must return pointers to the individual capability interfaces. For each PluginCapabilityType, it is imperative to cast through the corresponding capability interface to remove ambiguity for the compiler.

IPluginCapability* PadPlugin::getCapabilityInterface(PluginCapabilityType type) noexcept override
{
    // All plugin interface methods are noexcept and care should be
    // taken not to throw exceptions across the API boundary. It is
    // recommended to catch any exceptions and return a value that
    // appropriately represents the error status.
    try
    {
        if (type == PluginCapabilityType::kBUILD)
        {
            return static_cast<IPluginV3OneBuild*>(this);
        }
        if (type == PluginCapabilityType::kRUNTIME)
        {
            return static_cast<IPluginV3OneRuntime*>(this);
        }
        ASSERT(type == PluginCapabilityType::kCORE);
        return static_cast<IPluginV3OneCore*>(this);
    }
    catch(...)
    {
        // log error
    }
    return nullptr;

}

The methods that are of importance in this particular example are:

INetworkDefinition::addPluginV3
IPluginV3OneBuild::getNbOutputs
IPluginV3OneBuild::getOutputDataTypes
IPluginV3OneBuild::getOutputShapes
IPluginV3OneBuild::supportsFormatCombination
IPluginV3OneBuild::configurePlugin
IPluginV3OneRuntime::onShapeChange
IPluginV3OneRuntime::enqueue

INetworkDefinition::addPluginV3 (C++, Python) can add the plugin to the network.

std::vector<ITensor*> inputs{X};

auto pluginLayer = network->addPluginV3(inputs.data(), inputs.size(), nullptr, 0, *plugin);

You can communicate that there is a single plugin output by overriding IPluginV3OneBuild::getNbOutputs.

int32_t PadPlugin::getNbOutputs() const noexcept override
{
    return 1;
}

The output will have the same data type as the input, which can be communicated in the override of IPluginV3OneBuild::getOutputDataTypes.

int32_t PadPlugin::getOutputDataTypes(
        DataType* outputTypes, int32_t nbOutputs, DataType const* inputTypes, int32_t nbInputs) const noexcept override
{
    outputTypes[0] = inputTypes[0];
    return 0;
}

The override for getOutputShapes returns symbolic expressions for the output dimensions in terms of the input dimensions, except in the case of data-dependent output shapes, which will be covered later in Example: Adding a Custom Layer with a Data-Dependent and Shape Input-Dependent Shapes Using C++. In the current example, the first two dimensions of the output will equal the first two dimensions of the input, respectively, and the last two dimensions will be constants, each equal to 32. The IExprBuilder passed into getOutputShapes can be used to define constant symbolic expressions.

int32_t PadPlugin::getOutputShapes(DimsExprs const* inputs, int32_t nbInputs, DimsExprs const* shapeInputs, int32_t nbShapeInputs, DimsExprs* outputs, int32_t nbOutputs, IExprBuilder& exprBuilder) noexcept
{
    outputs[0].nbDims = 4;
    // first two output dims are equal to the first two input dims
    outputs[0].d[0] = inputs[0].d[0];
    outputs[0].d[1] = inputs[0].d[1];
    // The last two output dims are equal to 32
    outputs[0].d[2] = exprBuilder.constant(32);
    outputs[0].d[3] = exprBuilder.constant(32);
    return 0;
}

TensorRT uses supportsFormatCombination to ask whether the plugin accepts a given type and format combination for a connection at a given position pos and given formats/types for lesser-indexed connections. The interface indexes the inputs/outputs uniformly as connections, starting at 0 for the first input, then the rest of the inputs in order, followed by numbering the outputs. In the example, the input is connection 0, and the output is connection 1.

For the sake of simplicity, the example supports only linear formats and FP32 types.

bool PadPlugin::supportsFormatCombination(
        int32_t pos, DynamicPluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept override
{
    assert(0 <= pos && pos < 2);
    return inOut[pos].desc.format == PluginFormat::kLINEAR && inOut[pos].desc.type == DataType::kFLOAT;
}

TensorRT invokes two methods to allow the plugin to make any configuration choices before enqueue(), both during auto-tuning (in the engine build phase) and when the engine is being executed (in the runtime phase).

IPluginV3OneBuild::configurePlugin: Called when a plugin is being prepared for profiling (auto-tuning) but not for any specific input size. The min, max, and opt values of the DynamicPluginTensorDesc correspond to the bounds on the tensor shape and its shape for auto-tuning. The desc.dims field corresponds to the dimensions of the plugin specified at network creation, including any wildcards (-1) for dynamic dimensions.
IPluginV3OneRuntime::onShapeChange: Called during both the build-phase and runtime phase before enqueue() to communicate the input and output shapes for the subsequent enqueue(). The output PluginTensorDesc will contain wildcards (-1) for any data-dependent dimensions specified through getOutputShapes().

This plugin does not need configurePlugin and onShapeChange to do anything, so they are no-ops:

int32_t PadPlugin::configurePlugin(DynamicPluginTensorDesc const* in, int32_t nbInputs, DynamicPluginTensorDesc const* out, int32_t nbOutputs) noexcept override
{
    return 0;
}

int32_t PadPlugin::onShapeChange(PluginTensorDesc const* in, int32_t nbInputs, PluginTensorDesc const* out, int32_t nbOutputs) noexcept override
{
    return 0;
}

Finally, the override PadPlugin::enqueue has to do the work. Since shapes are dynamic, enqueue is handed a PluginTensorDesc that describes each input and output’s dimensions, type, and format.

int32_t enqueue(PluginTensorDesc const* inputDesc, PluginTensorDesc const* outputDesc, void const* const* inputs,
        void* const* outputs, void* workspace, cudaStream_t stream) noexcept override
{
    // populate outputs and return status code
}

Example: Adding a Custom Layer with Data-Dependent and Shape Input-Dependent Shapes Using C++#

This section shows an example of a plugin with data-dependent and shape-input-dependent shapes. Note that data-dependent output shapes and adding shape inputs to a plugin are new features not present in V2 plugins.

Data-dependent Shapes (DDS): The shape of a plugin output could depend on the values of the input tensors.
Shape inputs: A plugin could accept shape and device tensor inputs. These inputs are only visible to the plugin as arguments to IPluginV3OneBuild::getOutputShapes(). Therefore, their sole purpose is to aid the plugin in performing output shape calculations.

For example, BarPlugin is a plugin with one device input X, one shape input S, and an output Y, where:

The first dimension of Y depends on the value of S.
The second dimension of Y is static.
The third dimension of Y depends on the shape of X.
The fourth dimension of Y is data-dependent.

Similar to PadPlugin in the prior example, BarPlugin uses multiple inheritance.

To add the plugin to the network, INetworkDefinition::addPluginV3 (C++, Python) can be used similarly. After the device tensor inputs, addPluginV3 takes two additional arguments to specify the shape tensor inputs.

std::vector<ITensor*> inputs{X};
std::vector<ITensor*> shapeInputs{S};

auto pluginLayer = network->addPluginV3(inputs.data(), inputs.size(), shapeInputs.data(), shapeInputs.size(), *plugin);

Note

The TensorRT ONNX parser provides an inbuilt feature to pass shape inputs to custom ops supported by IPluginV3-based plugins. The indices of the inputs to be interpreted as shape inputs must be indicated by a node attribute named tensorrt_plugin_shape_input_indices as a list of integers. For example, if the custom op has four inputs and the second and fourth inputs should be passed as shape inputs to the plugin, add a node attribute named tensorrt_plugin_shape_input_indices of type onnx.AttributeProto.ints containing the value [1, 3].

In the override for getOutputShapes, plugins must declare both the position and the bounds of each data-dependent dimension of each output tensor. The bounds can be expressed using a special output called a size tensor.

A size tensor is a scalar of either INT32 or INT64 data type, expressed through a value for auto-tuning and an upper bound; these values can either be constants or computed in terms of device input shapes or shape input values using IExprBuilder.

In this case, there is a singular data-dependent dimension, which we can represent using one size tensor. Note that any size tensor needed to express a data-dependent dimension counts as an output of the plugin; therefore, the plugin will have two outputs in total.

int32_t getNbOutputs() const noexcept override
{
    return 2;
}

Assume output Y is the same type as the device input X and that the data-dependent dimension size fits INT32 (the size tensor has type r). Then BarPlugin expresses the output data types like this:

int32_t getOutputDataTypes(
        DataType* outputTypes, int32_t nbOutputs, DataType const* inputTypes, int32_t nbInputs) const noexcept override
{
    outputTypes[0] = inputTypes[0];
    outputTypes[1] = DataType::kINT32;
    return 0;
}

The method getOutputShapes can build symbolic output shape expressions using the IExprBuilder passed to it. In what follows, note that size tensors must be explicitly declared 0D.

int32_t BarPlugin::getOutputShapes(DimsExprs const* inputs, int32_t nbInputs, DimsExprs const* shapeInputs, int32_t nbShapeInputs, DimsExprs* outputs, int32_t nbOutputs, IExprBuilder& exprBuilder) noexcept
{
    outputs[0].nbDims = 4;
    // The first output dimension depends on the value of S.
    // The value of S is encoded as fictitious dimensions.
    outputs[0].d[0] = shapeInputs[0].d[0];
    // The third output dimension depends on the shape of X
    outputs[0].d[2] = inputs[0].d[0];
    // The second output dimension is static
    outputs[0].d[1] = exprBuilder.constant(3);

    auto upperBound = exprBuilder.operation(DimensionOperation::kPROD, *inputs[0].d[2], *inputs[0].d[3]);
    auto optValue = exprBuilder.operation(DimensionOperation::kFLOOR_DIV, *upperBound, *exprBuilder.constant(2));

    // output at index 1 is a size tensor
    outputs[1].nbDims = 0; // size tensors must be declared as 0-D
    auto sizeTensor = exprBuilder.declareSizeTensor(1, *optValue, *upperBound);

    // The fourth output dimension is data-dependent
    outputs[0].d[3] = sizeTensor;

    return 0;
}

The override of supportsFormatCombination imposes the following conditions:

The device input X must have DataType::kFLOAT or DataType::kHALF.
The output Y must have the same type as X.
The size tensor output has the type DataType::kINT32.

Note

Shape inputs passed to the plugin through addPluginV3 (C++, Python) only appear as arguments to getOutputShapes() and are not counted or included among plugin inputs in any other plugin interface method.

bool BarPlugin::supportsFormatCombination(
        int32_t pos, DynamicPluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs) noexcept override
    {
        assert(0 <= pos && pos < 3);
        auto const* in = inOut;
        auto const* out = inOut + nbInputs;

        bool typeOk{false};

        switch (pos)
        {
        case 0: typeOk = in[0].desc.type == DataType::kFLOAT || in[0].desc.type == DataType::kHALF; break;
        case 1: typeOk = out[0].desc.type == in[0].desc.type; break;
        case 2: typeOk = out[1].desc.type == DataType::kINT32; break;
        }

        return inOut[pos].desc.format == PluginFormat::kLINEAR && typeOk;
    }

The local variables in and out here allow inspecting inOut by input or output number instead of connection number.

Important

The override inspects the format/type for a connection with an index less than pos but must never inspect the format/type for a connection with an index greater than pos. The example uses case 1 to check connection 1 against connection 0 and not case 0 to check connection 0 against connection 1.

configurePlugin and onShapeChange would be no-ops here, too; one thing to note is that in onShapeChange, the output’s PluginTensorDesc will contain a wildcard (-1) for the data-dependent dimension.

Implementing enqueue with data-dependent output shapes differs greatly from the static or dynamic shape cases. As with any other output, for an output with a data-dependent dimension, the output buffer passed to enqueue is guaranteed large enough to hold the corresponding output tensor (based on the upper bound specified through getOutputShapes).

Example: Adding a Custom Layer with INT8 I/O Support Using C++#

PoolPlugin is a plugin demonstrating how to addINT8 I/O for a custom pooling layer using IPluginV3. PoolPlugin multiply inherits from IPluginV3, IPluginV3OneCore, IPluginV3OneBuild, and IPluginV3OneRuntime, similar to the PadPlugin and BarPlugin examples above.

The main methods that affect INT8 I/O are:

supportsFormatCombination
configurePlugin

The override for supportsFormatCombination must indicate which INT8 I/O combination is allowed. This interface is similar to Example: Adding a Custom Layer with Dynamic Shapes using C++. In this example, the supported I/O tensor format is linear CHW with FP32, FP16, BF16, FP8, or INT8 data type, but the I/O tensor must have the same data type.

bool PoolPlugin::supportsFormatCombination(
        int32_t pos, DynamicPluginTensorDesc const* inOut, int32_t nbInputs, int32_t nbOutputs)  noexcept override
{
    assert(nbInputs == 1 && nbOutputs == 1 && pos < nbInputs + nbOutputs);
    bool condition = inOut[pos].desc.format == PluginFormat::kLINEAR;
    condition &= (inOut[pos].desc.type == DataType::kFLOAT ||
                inOut[pos].desc.type == DataType::kHALF ||
            inOut[pos].desc.type == DataType::kBF16 ||
                inOut[pos].desc.type == DataType::kFP8 ||
                inOut[pos].desc.type == DataType::kINT8);
    condition &= inOut[pos].desc.type == inOut[0].desc.type;
    return condition;
}

Important

If INT8 calibration must be used with a network with INT8 I/O plugins, the plugin must support FP32 I/O, as TensorRT uses FP32 to calibrate the graph.
If the FP32 I/O variant is not supported or INT8 calibration is not used, all required INT8 I/O tensor scales must be set explicitly.
Calibration cannot determine the dynamic range of a plugin’s internal tensors. Plugins that operate on quantized data must calculate their dynamic range for internal tensors.
A plugin can be designed to accept FP8 and INT8 I/O types, although note that in TensorRT 9.0, the builder does not allow networks that mix INT8 and FP8.

Information communicated by TensorRT through configurePlugin or onShapeChange can be used to obtain information about the pooling parameters and the input and output scales. These can be stored as member variables, serialized, and then deserialized to be used during inference.

int32_t PoolPlugin::configurePlugin(DynamicPluginTensorDesc const* in, int32_t nbInputs, DynamicPluginTensorDesc const* out, int32_t nbOutputs) noexcept override
{
    ...
    mPoolingParams.mC = in.desc.d[1];
    mPoolingParams.mH = in.desc.d[2];
    mPoolingParams.mW = in.desc.d[3];
    mPoolingParams.mP = out.desc.d[2];
    mPoolingParams.mQ = ou.desc.d[3];
    mInHostScale = in[0].desc.scale >= 0.0F ? in[0].desc.scale : -1.0F;
    mOutHostScale = out[0].desc.scale >= 0.0F ? out[0].desc.scale : -1.0F;
}

INT8 I/O scales per tensor have been obtained from PluginTensorDesc::scale.

Adding Custom Layers using the Python API (TensorRT >= 10.6)#

For most use cases, defining Python plugins with a decorator-based approach is recommended (available starting in TensorRT 10.6). Refer to the Writing Custom Operators with TensorRT Python Plugins in the TensorRT Python API documentation for a manual describing different use cases and best practices. Note that embedding Python-defined plugins to TensorRT engines such that the engine is independent of Python and the plugin source itself, is only possible with this approach.

Adding Custom Layers using the Python API (Advanced/TensorRT <= 10.5)#

Python with a class-based approach (it is also the only supported approach for TensorRT <= 10.5). In contrast to decorator-based Python plugins (described in the preceding section), class-based plugins offer the following:

Statefulness: class-based plugins have stated (for example, configured/non-configured) and more granular querying by TensorRT for different plugin properties and behaviors.
Shape tensor input support.
Fine-grained control of the plugin instances TensorRT creates during engine deserialization is only possible with custom plugin creator definitions, which are only available with a class-based approach.
Manual serialization and deserialization of plugin attributes.
Ability to pre-request a device memory scratch space (workspace in addition to input/output buffers) to avoid execution-time device memory allocations.

These often come at the expense of increased implementation complexity and code bloat, which can lead to more bugs. Therefore, a tradeoff analysis is recommended before considering class-based plugin implementations in Python.

Implementing a class-based plugin in Python is similar to C++ in that implementation of IPluginV3 and IPluginCreatorV3One is necessary. Furthermore, interface methods in Python have mostly similar APIs to their C++ counterparts; most differences are minor and self-explanatory.

The following list includes a few selected changes. Subsequent subsections describe the differences involved in more detail.

The following plugin APIs have been omitted in favor of reading/writing to an appropriately named attribute.

Adding Custom Layers using the Python API (Advanced/TensorRT <= 10.5)#
Class	Method	Replaced with Attribute
`IPluginV3OneCore`	`getPluginName()`	`plugin_name[str]`
`IPluginV3OneCore`	`getPluginNamespace()`	`plugin_namespace [str]`
`IPluginV3OneCore`	`getPluginVersion()`	`plugin_version [str]`
`IPluginV3OneBuild`	`getNbOutputs()`	`num_outputs [int]`
`IPluginV3OneBuild`	`getTimingCacheID()`	`timing_cache_id [str]`
`IPluginV3OneBuild`	`getMetadataString()`	`metadata_string [str]`
`IPluginV3OneBuild`	`getFormatCombinationLimit()`	`format_combination_limit [int]`
`IPluginCreatorV3One`	`getPluginNamespace()`	`plugin_namespace [str]`
`IPluginCreatorV3One`	`getFieldNames()`	`field_names [PluginFieldCollection]`
`IPluginCreatorV3One`	`getPluginName()`	`name [str]`
`IPluginCreatorV3One`	`getPluginVersion()`	`plugin_version [str]`

Some methods have default implementations; these can be left unimplemented, and the default behaviors outlined below will take effect:

class trt.IPluginV3:
    def destroy(self):
        pass

class trt.IPluginV3OneBuild:
    def get_valid_tactics(self):
        return []

    def get_workspace_size(self, input_desc, output_desc):
        return 0

Methods that must return integer status codes in IPluginV3OneBuild and IPluginV3OneRuntime should raise exceptions in Python instead. For example:

C++

int32_t configurePlugin(DynamicPluginTensorDesc const* in, int32_t nbInputs, DynamicPluginTensorDesc const* out, int32_t nbOutputs)

Python

configure_plugin(self: trt.IPluginV3OneBuild, in: List[trt.DynamicPluginTensorDesc], out: List[trt.DynamicPluginTensorDesc]) -> None

For example, you can raise a ValueError during enqueue if an input has an illegal value.

The Python API IPluginV3.destroy() has no direct equivalent in the C++ API. Python plugins are expected to perform any functionality that would be performed in an IPluginV3 C++ destructor within the IPluginV3.destroy() method.

For full examples demonstrating Python plugins, refer to the python_plugin sample.

Registration of a Python Plugin#

Python plugins must be registered dynamically through the IPluginRegistry.register_creator() API. There is no analog to the REGISTER_TENSORT_PLUGIN available for static registration.

Building and Running TensorRT Engines Containing Python Plugins#

It is possible to build TensorRT engines using Python-based plugins. However, running such engines outside of Python is currently impossible since the plugin must be available in the scope where the engine is being deserialized. For example, you cannot use a tool like trtexec directly.

Implementing `enqueue` of a Python Plugin#

The API for IPluginV3OneRuntime::enqueue() in C++ and Python are as follows:

C++

int32_t enqueue(PluginTensorDesc const *inputDesc, PluginTensorDesc const *outputDesc, void const *const *inputs, void *const *outputs, void *workspace, cudaStream_t stream)

Python

enqueue(self: trt.IPluginV3OneRuntime, input_desc: List[trt.PluginTensorDesc], output_desc: List[trt.PluginTensorDesc], inputs: List[int], outputs: List[int], workspace: int, stream: int) -> None

Here, inputs, outputs, and workspace are passed in as intptr_t casts of the respective device pointers. Similarly, a stream is an intptr_t cast of a pointer to the CUDA stream handle. There is flexibility within Python on how to read from and write to these buffers, and this can be achieved depending on the particular use case. For example, with CUDA Python, this is quite simple since cuda.cuLaunchKernel accepts int representing the pointers wrapped in NumPy arrays:

d_input = np.array([inputs[0]], dtype=np.uint64)
d_output = np.array([outputs[0]], dtype=np.uint64)
stream_ptr = np.array([stream], dtype=np.uint64)
args = [d_input,  d_output]
kernel_args = np.array([arg.ctypes.data for arg in args], dtype=np.uint64)
…
checkCudaErrors(cuda.cuLaunchKernel(_float_kernel,
                                        num_blocks, 1, 1,
                                        block_size, 1, 1,
                                        0,
                                        stream_ptr,
                                        kernel_args , 0))

Translating Device Buffers/CUDA Stream Pointers in `enqueue` to other Frameworks#

Constructing CuPy arrays on top of device buffers is possible using CuPy’s UnownedMemory class.

def enqueue(self, input_desc, output_desc, inputs, outputs, workspace, stream):
...
inp_dtype = trt.nptype(input_desc[0].type)
inp_mem = cp.cuda.UnownedMemory(
    inputs[0], volume(input_desc[0].dims) * cp.dtype(inp_dtype).itemsize, self
)
out_mem = cp.cuda.UnownedMemory(
    outputs[0],
    volume(output_desc[0].dims) * cp.dtype(inp_dtype).itemsize,
    self,
)

inp_ptr = cp.cuda.MemoryPointer(inp_mem, 0)
out_ptr = cp.cuda.MemoryPointer(out_mem, 0)

inp = cp.ndarray((volume(input_desc[0].dims)), dtype=inp_dtype, memptr=inp_ptr)
out = cp.ndarray((volume(output_desc[0].dims)), dtype=inp_dtype, memptr=out_ptr)

If needed, torch.as_tensor() can then be used to construct a Torch array:

# inp_d = cp.ndarray(tuple(input_desc[0].dims), dtype=inp_dtype, memptr=inp_ptr)
inp_t = torch.as_tensor(inp_d, device='cuda')

Similarly, CuPy stream handles can be constructed from the passed-in stream pointer through CuPy’s ExternalStream class.

cuda_stream = cp.cuda.ExternalStream(stream)

Automatic Downcasting#

TensorRT Python bindings will do automatic downcasting for custom types written in Python implementing interfaces like IPluginCreatorV3One or IPluginResource. For example, take the following method from IPluginRegistry as an example:

get_creator(self: trt.IPluginRegistry, name: string, version: string,
namespace: string = “”) -> trt.IPluginCreatorInterface

The return type is indicated as IPluginCreatorInterface. However, in practice, if you were to write a class MyPluginCreator implementing IPluginCreatorV3One (which in turn implements IPluginCreatorInterface), the get_creator method will return an automatically downcasted type of MyPluginCreator.

This extends to trt.IPluginRegistry.all_creators, which is a List[trt.IPluginCreatorInterface]. If you had registered a plugin creator of type MyPluginCreator and another type MyOtherPluginCreator, both plugin creators will be present as those respective types in the list.

Example: Adding a Custom Layer to a TensorRT Network Using Python#

Using plugin nodes, custom layers can be added to any TensorRT network in Python. The Python API has a function called add_plugin_v3 that enables adding a plugin node to a network. The following example illustrates this. It creates a simple TensorRT network and adds a hypothetical plugin node by looking up the TensorRT plugin registry.

import tensorrt as trt
import numpy as np

TRT_LOGGER = trt.Logger()

trt.init_libnvinfer_plugins(TRT_LOGGER, '')
def get_trt_plugin(plugin_name, plugin_version, plugin_namespace):
    plugin = None
    plugin_creator = trt.get_plugin_registry().get_creator(plugin_name, plugin_version, plugin_namespace)
    # trt will automatically downcast to IPluginCreator or IPluginCreatorInterface
    # Can inspect plugin_creator.interface_info to make sure
    if plugin_creator is not None:
        lrelu_slope_field = trt.PluginField("epsilon", np.array([0.00000001], dtype=np.float32), trt.PluginFieldType.FLOAT32)
        field_collection = trt.PluginFieldCollection([lrelu_slope_field])
        plugin = plugin_creator.create_plugin(name=plugin_name, field_collection=field_collection, phase=trt.TensorRTPhase.BUILD)
    return plugin

def main():
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network()
    config = builder.create_builder_config()
    config.max_workspace_size = 2**20
    input_layer = network.add_input(name="input_layer", dtype=trt.float32, shape=(1, 1))
    plugin = network.add_plugin_v3(inputs=[input_layer], shape_inputs=[], plugin=get_trt_plugin("MY_PLUGIN", "1", ""))
    plugin.get_output(0).name = "outputs"
    network.mark_output(plugin.get_output(0))

Enabling Timing Caching and Using Custom Tactics#

IPluginV3 provides more control over the profiling of custom layers, which were unavailable with V2 plugins and earlier. One such feature is enabling timing caching. If a TensorRT network contains multiple instances of the same plugin, identically configured (for example, same plugin attribute values) and handling identical input-output shapes and types, then it would make sense to time (measure latency) of only one instance, cache the latency, and skip timing the rest of the instances. This could enable large savings in terms of engine build time.

Timing caching for IPluginV3 plugins is an opt-in feature; to opt-in, the plugin must advertise a non-null timing cache ID.

C++

char const* FooPlugin::getTimingCacheID() noexcept override
{
    // return nullptr to disable timing caching (default behavior)
    // return non-null string to enable timing caching
}

Python

def FooPlugin(trt.IPluginV3, trt.IPluginV3OneBuild, ...):
    def __init__(self):
        # set to None to disable timing caching
        self.timing_cache_id = value

Note the following regarding the timing cache ID:

The user-provided timing cache ID should be considered a suffix to a larger cache ID; TensorRT automatically forms a prefix by considering the plugin’s input/output shape and format information. Usually, the user-provided timing cache ID could consist of plugin attributes and their values.
It must reflect the plugin’s creation state and not evolve after creation.

For V2 plugins, TensorRT only times the plugin for any (multiple) type/format combinations it claims to support. With IPluginV3, plugins can also ensure custom tactics are timed, and TensorRT uses the fastest tactic. For example, the plugin may have one of two kernels to compute the output, and it may not be possible to predict which one would be fastest on a specific platform and for specific input/output shapes and formats. It is possible to ask TensorRT to time the plugin for each tactic for each format combination, figure out the fastest such configuration, and use that during inference.

Note

TensorRT may choose not to time the plugin if it only supports one type/format combination and either does not use custom tactics or only advertises one.
For IPluginV3OneBuild, TensorRT times a maximum of getFormatCombinationLimit() type/format combinations for each tactic; override this method to increase/decrease this limit depending on need.

To get started, advertise the custom tactics to TensorRT:

C++

int32_t FooPlugin::getNbTactics() noexcept override
{
    return 2; // return 0 to disable custom tactics (default behavior)
}

int32_t FooPlugin::getValidTactics(int32_t* tactics, int32_t nbTactics) noexcept override
{
    tactics[0] = 1;
    tactics[1] = 2;
    return 0;
}

Python

def get_valid_tactics(self):
    return [1, 2] # return empty vector to disable custom tactics (default behavior)

Any strictly positive integer could be used as a custom tactic value (TensorRT reserves 0 as the default tactic).

When the plugin is timed, configurePlugin() is guaranteed to be called with the current input/output format combination before getValidTactics() is called. Therefore, it is possible to advertise a different set of tactics per input/output format combination. For example, for a plugin that supports FP32 and FP16, tactic 1 may be restricted to only FP16 while supporting tactics 1 and 2 for FP32.

During the engine build, when auto-tuning the plugin, TensorRT will communicate the tactic for the subsequent enqueue() by invoking IPluginV3OneRuntime::setTactic (C++, Python). When an engine is deserialized, TensorRT will invoke setTactic once the plugin has been created to communicate the best tactic chosen for the plugin. Even if custom tactics are not used, setTactic will be called with the default tactic value 0.

Sharing Custom Resources Among Plugins#

Starting in TensorRT 10.0, a key-value store is associated with the plugin registry. This store can store user-implemented IPluginResource (C++, Python) objects against a string key. This functionality can be used to share state or some resources among different plugins. Note that it is not tied to IPluginV3 (or even to plugin interfaces).

Let us explore an example.

Example: Sharing Weights Downloaded Over a Network Among Different Plugins#

Assume that several plugins need access to the same weights, W. Due to licensing restrictions, you may prefer that these weights be downloaded when the engine runs. However, due to W’s large size, it is also desirable that only one copy is downloaded, which is shared among all plugins needing access.

Implement SharedWeights class, which implements IPluginResource.
Each plugin that requires access to the weights requests an instance of initialized (downloaded) SharedWeights by calling IPluginRegistry::acquirePluginResource(...) (C++, Python).
C++
1IPluginResource* acquirePluginResource(char const* key, IPluginResource* resource)
Python
1acquire_plugin_resource(self: trt.IPluginRegistry, key: str, resource: trt.IPluginResource) -> trt.IPluginResource
The first time acquirePluginResource is called against a particular key, TensorRT registers a clone of the provided plugin resource instead of the object passed as a resource. The registered object is obtained by invoking resource->clone(). Therefore, it is best practice only to initialize clones – in this case, the weight download can be done in IPluginResource::clone().
After each plugin has finished using the weights, it can call IPluginRegistry::releasePluginResource() to signal that it no longer wishes to use them.
C++
1int32_t releasePluginResource(char const* key)
Python
1release_plugin_resource(self: trt.IPluginRegistry, key: str) -> None
TensorRT performs reference counting on the acquirePluginResource and releasePluginResource calls made against a particular key and will call IPluginResource::release() if and when the reference count reaches zero. In this example, this functionality can be leveraged to free up the memory used by the weights when all plugins have finished using it.

Finally, the SharedWeights class can be implemented as follows:

class SharedWeights : public IPluginResource
{
public:
    SharedWeights(bool init = false)
    {
        if(init)
        {
            PLUGIN_CHECK(cudaMalloc((void**) &cloned->mWeights, ...));
        }
    }

    int32_t release() noexcept override
    {
        try
        {
            PLUGIN_CHECK(cudaFree(mWeights));
            mWeights = nullptr;
        }
        catch
        {
            return -1;
        }
        return 0;
    }

    IPluginResource* clone() noexcept override
    {
        try
        {
            auto cloned = std::make_unique<SharedWeights>(/* init */ true);
            //
            // Download the weights
            //
            // Copy to device memory
            PLUGIN_CHECK(cudaMemcpy(cloned->mWeights, ...));
        }
        catch
        {
            return nullptr;
        }
        return cloned.release();
    }

    ~SharedWeights() override
    {
        release();
    }

    float* mWeights{nullptr};
};

Say FooPlugin needs access to the weights. It can request the weights when it is being made ready for inference. This can be done in IPluginV3OneRuntime::onShapeChange, which will be called at least once for plugins about to be enqueue() during both the build and runtime phases.

int32_t onShapeChange(
    PluginTensorDesc const* in, int32_t nbInputs, PluginTensorDesc const* out, int32_t nbOutputs) noexcept override
{
    SharedWeights w{};
    mW = static_cast<SharedWeights*>(getPluginRegistry()->acquirePluginResource("W", &w))->mWeights;
    return 0;
}

The acquired weights (mW) can then be used in the subsequent enqueue(). To wrap up, the plugin can signal intent to release in its destructor (note that there is no separate release resource routine similar to IPluginV2DynamicExt::terminate() in IPluginV3).

~FooPlugin() override
{
    TRY
    {
        PLUGIN_CHECK(getPluginRegistry()->releasePluginResource("W"));
    }
    CATCH
    {
        // Error handling
    }
}

All plugins requiring access to the weights can use the same code above. The reference counting mechanism will ensure the weights’ availability and proper freeing.

Using Custom Layers When Importing a Model with a Parser#

The ONNX parser automatically attempts to import unrecognized nodes as plugins. If a plugin with the same op_type as the node is found in the plugin registry, the parser forwards the node’s attributes to the plugin creator as plugin field parameters to create the plugin. By default, the parser uses "1" as the plugin version and """ as the plugin namespace. This behavior can be overridden by setting a plugin_version and plugin_namespace string attribute in the corresponding ONNX node.

Sometimes, you can modify an ONNX graph before importing it into TensorRT. For example, to replace a set of ops with a plugin node. To accomplish this, you can use the ONNX GraphSurgeon utility. For details on how to use ONNX-GraphSurgeon to replace a subgraph, refer to this example.

For more examples, refer to the onnx_packnet sample.

Plugin API Description#

All new plugins should derive from both IPluginCreatorV3One and IPluginV3 classes. In addition, new plugins should also be registered in the plugin registry, either dynamically by using IPluginRegistry::registerCreator() or statically using the REGISTER_TENSORRT_PLUGIN(...) macro. Custom plugin libraries can also consider implementing an init function equivalent to initLibNvInferPlugins() to perform bulk registration.

Note

Automotive safety users must use the REGISTER_SAFE_TENSORRT_PLUGIN(...) macro instead of REGISTER_TENSORRT_PLUGIN(...).

`IPluginV3` API Description#

The following section describes the functions of IPluginV3 and, by extension, IPluginV3OneCore, IPluginV3OneBuild or IPluginV3OneBuildV2, and IPluginV3OneRuntime.

Since an IPluginV3 object consists of different capabilities, IPluginV3::getCapabilityInterface may be called anytime during its lifetime. An IPluginV3 object added for the build phase must return a valid capability interface for all capability types: core, build, and runtime. The build capability may be omitted for objects added for the runtime phase.

There are a few methods used to request identifying information about the plugin. They may also be called during any stage of the plugin’s lifetime.

IPluginV3OneCore::getPluginName: Used to query for the plugin’s name
IPluginV3OneCore::getPluginVersion: Used to query for the plugin’s version
IPluginV3OneCore::getPluginNamespace: Used to query for the plugin’s namespace
IPluginV3OneBuild::getMetadataString: Used to query for a string representation of any metadata associated with the plugin, such as the values of its attributes.

To connect a plugin layer to neighboring layers and set up input and output data structures, the builder checks for the number of outputs and their shapes by calling the following plugin methods:

IPluginV3OneBuild::getNbOutputs: Used to specify the number of output tensors.
IPluginV3OneBuild::getOutputShapes: This function specifies the output shapes as a function of the input shapes or constants. The exception is data-dependent shapes with a specified upper bound and optimal tuning value.
IPluginV3OneBuild::supportsFormatCombination: Used to check if a plugin supports a given data type and format combination.
IPluginV3OneBuild::getOutputDataType: This function retrieves the data types of the output tensors. The returned data types must be in a format supported by the plugin.

If the IPluginV3OneBuildV2 build capability is used, the plugin can also communicate to TensorRT that certain input-output pairs are aliased (share the same data buffer). TensorRT will query IPluginV3OneBuildV2::getAliasedInput to determine any such aliasing behavior. To use this feature, PreviewFeature::kALIASED_PLUGIN_IO_10_03 must be enabled.

Plugin layers can support the following data formats:

LINEAR single-precision (FP32), half-precision (FP16), brain floating-point (BF16), 8-bit floating-point E4M3 (FP8), integer (INT8), and integer (INT32) tensors
CHW32 single-precision (FP32) and integer (INT8) tensors
CHW2, HWC8, HWC16, and DHWC8 half-precision (FP16) tensors
CHW4 half-precision (FP16), and integer (INT8) tensors
HWC8, HWC4, NDHWC8, NC2HW brain floating-point (BF16) tensors

PluginFormat counts the formats.

Plugins that do not compute all data in place and need memory space in addition to input and output tensors can specify the additional memory requirements with the IPluginV3OneBuild::getWorkspaceSize method, which the builder calls to determine and preallocate scratch space.

The layer is configured, executed, and destroyed at build time to discover optimal configurations. After selecting the optimal configuration for a plugin, the chosen tactic and concrete shape/format information (except for data-dependent dimensions) are communicated to the plugin during inference. It is executed as many times as needed for the lifetime of the inference application and finally destroyed when the engine is destroyed.

The builder controls these steps and runtime using the following plugin methods. Methods also called during inference are indicated by (*) - all others are only called by the builder.

IPluginV3OneBuild::attachToContext*: This function requests that a plugin clone be attached to an ExecutionContext, allowing the plugin to access any context-specific resources.
IPluginV3OneBuild::getTimingCacheId: This function queries for any timing cached ID that TensorRT may use. If provided, it enables timing caching (it is disabled by default).
IPluginV3OneBuild::getNbTactics: Used to query for the number of custom tactics the plugin chooses to use.
IPluginV3OneBuild::getValidTactics: This function queries for any custom tactics the plugin may use. The plugin will be profiled for each tactic up to a maximum indicated by IPluginV3OneBuild::getFormatCombinationLimit().
IPluginV3OneBuild::getFormatCombinationLimit: This function queries the maximum number of format combinations that may be timed for each tactic (0 if no custom tactics are advertised for the default tactic).
IPluginV3OneRuntime::setTactic*: Communicates the tactic to be used during the subsequent enqueue(). If no custom tactics were advertised, this would always be 0.
IPluginV3OneBuild::configurePlugin: Communicates the number of inputs and outputs and their shapes, data types, and formats. The min, opt, and max of each input or output’s DynamicPluginTensorDesc correspond to the kMIN, kOPT, and kMAX values of the optimization profile that the plugin is currently profiled for. The desc.dims field corresponds to the dimensions of plugin inputs specified at network creation. Wildcard dimensions may exist during this phase in the desc.dims field. At this point, the plugin may set up its internal state and select the most appropriate algorithm and data structures for the given configuration.
IPluginV3OneRuntime::onShapeChange*: Communicates the number of inputs and outputs and their shapes, data types, and formats. The dimensions are concrete, except if data-dependent dimensions exist, which wildcards will indicate.
IPluginV3OneRuntime::enqueue*: Encapsulates the actual algorithm and kernel calls of the plugin and provides pointers to input, output, and scratch space, as well as the CUDA stream to be used for kernel execution.
IPluginV3::clone: This is called every time a new builder, network, or engine is created that includes this plugin layer. It must return a new plugin object with the correct parameters.

After the builder completes profiling, before the engine is serialized, IPluginV3OneRuntime::getFieldsToSerialize is called to query for any plugin fields that must be serialized into the engine. These are expected to be data that the plugin needs to function properly during the inference stage once the engine has been deserialized.

`IPluginCreatorV3One` API Description#

The following methods in the IPluginCreatorV3One class are used to find and create the appropriate plugin from the plugin registry:

getPluginName: This returns the plugin name and should match the return value of IPluginV3OneCore::getPluginName.
getPluginVersion: Returns the plugin version. For all internal TensorRT plugins, this defaults to 1.
getPluginNamespace: Returns the plugin namespace. The default can be "".
getFieldNames: To successfully create a plugin, you must know all the plugin’s field parameters. This method returns the PluginFieldCollection struct with the PluginField entries populated to reflect the field name and PluginFieldType (the data should point to nullptr).
createPlugin: This method creates a plugin, passing a PluginFieldCollection and a TensorRTPhase argument.

During engine deserialization, TensorRT calls this method with the TensorRTPhase argument set to TensorRTPhase::kRUNTIME and the PluginFieldCollection populated with the same PluginFields as in the one returned by IPluginV3OneRuntime::getFieldsToSerialize(). In this case, TensorRT takes ownership of plugin objects returned by createPlugin.

You may also invoke createPlugin to produce plugin objects to add to a TensorRT network. In this case, setting the phase argument to TensorRTPhase::kBUILD is recommended. The data passed with the PluginFieldCollection should be allocated and freed by the caller before the program is destroyed. The ownership of the plugin object returned by the createPlugin function is passed to the caller and must be destroyed.

Migrating V2 Plugins to `IPluginV3`#

IPluginV2 and IPluginV2Ext have been deprecated since TensorRT 8.5, and IPluginV2IOExt and IPluginV2DynamicExt are deprecated in TensorRT 10.0. Therefore, new plugins should target IPluginV3, and old ones should be refactored.

Keep in mind the following key points when migrating an IPluginV2DynamicExt plugin to IPluginV3:

The plugin creator associated with the plugin must be migrated to IPluginCreatorV3One, the factory class for IPluginV3 (IPluginCreator is the factory class for IPluginV2 derivatives). This simply consists of migrating IPluginCreator::deserializePlugin. For more information, refer to the Plugin Serialization and Deserialization section.
There is no equivalent to IPluginV2::initialize(), IPluginV2::terminate(), and IPluginV2::destroy() in IPluginV3. For more information, refer to the Plugin Initialization and Termination section.
There is no equivalent to IPluginV2Ext::detachFromContext() in IPluginV3. For more information, refer to the Accessing Context-Specific Resources Provided by TensorRT section.
IPluginV3OneRuntime::attachToContext() is markedly different from IPluginV2Ext::attachToContext() regarding arguments and behavior. For more information, refer to the Accessing Context-Specific Resources Provided by TensorRT section.
In IPluginV3, plugin serialization is through a PluginFieldCollection that gets passed to TensorRT by IPluginV3OneRuntime::getFieldsToSerialize() and deserialization is through the same PluginFieldCollection that gets passed back by TensorRT to IPluginCreatorV3One::createPlugin(...). For more information, refer to the Plugin Serialization and Deserialization section.
The IPluginV3 equivalents of void return methods in IPluginV2DynamicExt will expect an integer status code as a return value (for example, configurePlugin).
supportsFormatCombination and getWorkspaceSize get dynamic tensor descriptors (DynamicPluginTensorDesc) instead of static descriptors (PluginTensorDesc).
IPluginV2DynamicExt::getOutputDimensions() becomes IPluginV3OneBuild::getOutputShapes() and changes to an output parameter signature instead of a return value. It also shifts from per-output index querying to one-shot querying. A similar transition applies from IPluginV2Ext::getOutputDataType to IPluginV3OneBuild::getOutputDataTypes.

Plugin Initialization and Termination#

IPluginV2 provided several APIs for plugin initialization and termination: namely, IPluginV2::initialize(), IPluginV2::terminate(), and IPluginV2::destroy(). In IPluginV3, plugins are expected to be constructed in an initialized state; if your V2 plugin had any lazy initialization in initialize, it can be deferred to onShapeChange or configurePlugin. Any resource release or termination logic in IPluginV2::terminate() or IPluginV2::destroy() can be moved to the class destructor. The exception is in the Python API; IPluginV3.destroy() is provided as an alternative for a C++-like destructor.

Accessing Context-Specific Resources Provided by TensorRT#

IPluginV2Ext::attachToContext() provided plugins access to context-specific resources, namely the GPU allocator and cuDNN and cuBLAS handles. IPluginV3OneRuntime::attachToContext() is meant to provide a similar service to plugins, but it instead provides an IPluginResourceContext, which in turn exposes resources that plugins may request.

In a departure from IPluginV2Ext::attachToContext(), cuDNN and cuBLAS handles are no longer provided by IPluginResourceContext; any plugins that depended on those should migrate to initialize their own cuDNN and cuBLAS resources. If sharing cuDNN/cuBLAS resources among plugins is preferred, you can utilize the functionality provided by IPluginResource and the plugin registry’s key-value store to accomplish this. For more information, refer to the Sharing Custom Resources Among Plugins section.

IPluginV3OneRuntime::attachToContext(...) is a clone-and-attach operation. It is asked to clone the entire IPluginV3 object—not just the runtime capability. Therefore, if implemented as a separate class, the runtime capability object may need to hold a reference to the IPluginV3 object of which it is a part.

Any context-specific resource obtained through IPluginResourceContext may be used until the plugin is destroyed. Therefore, any termination logic implemented in IPluginV2Ext::detachFromContext() may be moved to the plugin destructor.

Plugin Serialization and Deserialization#

For V2 plugins, serialization and deserialization were determined by the implementation of IPluginV2::serialize, IPluginV2::getSerializationSize, and IPluginCreator::deserializePlugin; IPluginV3OneRuntime::getFieldsToSerialize and IPluginCreatorV3One::createPlugin have replaced these. Note that the workflow has shifted from writing to/reading from a raw buffer to constructing and parsing a PluginFieldCollection.

TensorRT handles the serialization of types defined in PluginFieldType. Custom types can be serialized as PluginFieldType::kUNKNOWN. For example:

struct DummyStruct
{
    int32_t a;
    float b;
};

DummyPlugin()
{
    // std::vector<nvinfer1::PluginField> mDataToSerialize;
    // int32_t mIntValue;
    // std::vector<float> mFloatVector;
    // DummyStruct mDummyStruct;
    mDataToSerialize.clear();
    mDataToSerialize.emplace_back(PluginField("intScalar", &mIntValue, PluginFieldType::kINT32, 1));
    mDataToSerialize.emplace_back(PluginField("floatVector", mFloatVector.data(), PluginFieldType::kFLOAT32, mFloatVector.size()));
    mDataToSerialize.emplace_back(PluginField("dummyStruct", &mDummyStruct, PluginFieldType::kUNKNOWN, sizeof(DummyStruct)));
    mFCToSerialize.nbFields = mDataToSerialize.size();
    mFCToSerialize.fields = mDataToSerialize.data();
}

nvinfer1::PluginFieldCollection const* DummyPlugin::getFieldsToSerialize() noexcept override
{
    return &mFCToSerialize;
}

Migrating Older V2 Plugins to `IPluginV3`#

If migrating from IPluginV2 or IPluginV2Ext to IPluginV3, it is easier to migrate first to IPluginV2DynamicExt and then follow the guidelines above to migrate to IPluginV3. The new features in IPluginV2DynamicExt are as follows:

virtual DimsExprs getOutputDimensions(int outputIndex, const DimsExprs* inputs, int nbInputs, IExprBuilder& exprBuilder) = 0;

virtual bool supportsFormatCombination(int pos, const PluginTensorDesc* inOut, int nbInputs, int nbOutputs) = 0;

virtual void configurePlugin(const DynamicPluginTensorDesc* in, int nbInputs, const DynamicPluginTensorDesc* out, int nbOutputs) = 0;

virtual size_t getWorkspaceSize(const PluginTensorDesc* inputs, int nbInputs, const PluginTensorDesc* outputs, int nbOutputs) const = 0;

virtual int enqueue(const PluginTensorDesc* inputDesc, const PluginTensorDesc* outputDesc, const void* const* inputs, void* const* outputs, void* workspace, cudaStream_t stream) = 0;

Guidelines for migration to IPluginV2DynamicExt are:

getOutputDimensions implements the expression for output tensor dimensions given the inputs.
supportsFormatCombination checks if the plugin supports the format and datatype for the specified I/O.
configurePlugin mimics the behavior of equivalent configurePlugin in IPluginV2Ext but accepts tensor descriptors.
getWorkspaceSize and enqueue mimic the behavior of equivalent APIs in IPluginV2Ext but accept tensor descriptors.

Coding Guidelines for Plugins#

Memory Allocation

Memory allocated in the plugin must be freed to ensure no memory leak. If resources are acquired in the plugin constructor or at a later stage, like onShapeChange, they must be released, possibly in the plugin class destructor.

Another option is to request any additional workspace memory required through getWorkspaceSize, which will be available during enqueue.

Add Checks to Ensure Proper Configuration and Validate Inputs

A common source for unexpected plugin behavior is improper configuration (for example, invalid plugin attributes) and invalid inputs. As such, it is good practice to add checks/assertions during the initial plugin development for cases where the plugin is not expected to work. The following are places where checks could be added:

createPlugin: Plugin attributes checks
configurePlugin or onShapeChange: Input dimension checks
enqueue: Input value checks

Return Null at Errors for Methods That Create a New Plugin Object

Methods like createPlugin, clone, and attachToContext may be expected to create and return new plugin objects. In these methods, ensure a null object (nullptr in C++) is returned in case of any error or failed check. This ensures that non-null plugin objects are not returned when configured incorrectly.

Avoid Device Memory Allocations in clone()

Since the builder calls clone multiple times, device memory allocations could be significantly expensive. One option is to do persistent memory allocations in the constructor, copy to a device when the plugin is ready (for example, in configurePlugin), and release during destruction.

Serializing Arbitrary Pieces of Data and Custom Types

Plugin authors can utilize PluginField of PluginFieldType::kUNKNOWN to indicate arbitrary pieces of data to be serialized. In this case, the length of the respective PluginField should be the number of bytes corresponding to the buffer pointed to by data. The serialization of non-primitive types can be achieved in this way.

Plugin Shared Libraries#

TensorRT contains built-in plugins that can be loaded statically into your application.

You can explicitly register custom plugins with TensorRT using the REGISTER_TENSORRT_PLUGIN and registerCreator interfaces (refer to Adding Custom Layers). However, you may want TensorRT to manage the registration of a plugin library and, in particular, serialize plugin libraries with the plan file so they are automatically loaded when the engine is created. This can be especially useful when you want to include the plugins in a version-compatible engine so that you do not need to manage them after building the engine. To take advantage of this, you can build shared libraries with specific entry points recognized by TensorRT.

Generating Plugin Shared Libraries#

To create a shared library for plugins, the library must have the following public symbols defined:

extern "C" void setLoggerFinder(ILoggerFinder* finder);
extern "C" IPluginCreator* const* getCreators(int32_t& nbCreators) const;

extern "C" above is only used to prevent name mangling, and the methods should be implemented in C++. Consult your compiler’s ABI documentation for more details.

setLoggerFinder() should set a global pointer of ILoggerFinder in the library for logging in the plugin code. getPluginCreators() returns a list of plugin creators your library contains. An example of these entry points can be found in plugin/common/vfcCommon.h/cpp.

To serialize your plugin libraries with your engine plan, provide the plugin libraries paths to TensorRT using setPluginsToSerialize() in BuilderConfig.

You can also package plugins in the plan when building version-compatible engines. The packaged plugins will have the same lifetime as the engine and will be automatically registered/deregistered when running the engine.

Using Plugin Shared Libraries#

After building your shared libraries, you can configure the builder to serialize them with the engine. Next time you load the engine into TensorRT, the serialized plugin libraries will be loaded and registered automatically.

Note

IPluginRegistry loadLibrary() (C++, Python) functionality now supports plugin-shared libraries containing both V2 and V3 plugin creators via the getCreators() entry point. The getPluginCreators() entry point is valid, too, but is deprecated. TensorRT first checks if the getCreators() symbol is available, and if not, checks for getPluginCreators() as a fallback for backward compatibility. You can then query this to enumerate each plugin creator and register it manually using IPluginRegistry registerCreator() (C++, Python).

Load the plugins for use with the builder before building the engine:

C++

for (size_t i = 0; i < nbPluginLibs; ++i)
{
    builder->getPluginRegistry().loadLibrary(pluginLibs[i]);
}

Python

for plugin_lib in plugin_libs:
    builder.get_plugin_registry().load_library(plugin_lib)

Next, decide if the plugins should be included with the engine or shipped externally. You can serialize the plugins with the plan as follows:

C++

IBuilderConfig *config = builder->createBuilderConfig();
...
config->setPluginsToSerialize(pluginLibs, nbPluginLibs);

Python

config = builder.create_builder_config()
...
config.plugins_to_serialize = plugin_libs

Alternatively, you can keep the plugins external to the engine. You will need to ship these libraries along with the engine when it is deployed and load them explicitly in the runtime before deserializing the engine:

C++

// In this example, getExternalPluginLibs() is a user-implemented method which retrieves the list of libraries to use with the engine
std::vector<std::string> pluginLibs = getExternalPluginLibs();
for (auto const &pluginLib : pluginLibs)
{
    runtime->getPluginRegistry().loadLibrary(pluginLib.c_str())
}

Python

# In this example, get_external_plugin_libs() is a user-implemented method which retrieves the list of libraries to use with the engine
plugin_libs = get_external_plugin_libs()
for plugin_lib in plugin_libs:
    runtime.get_plugin_registry().load_library(plugin_lib)

Extending TensorRT with Custom Layers#

Adding Custom Layers Using the C++ API#

Implementing a Plugin Class#

Implementing a Plugin Creator Class#

Registering a Plugin Creator with the Plugin Registry#

Adding a Plugin Instance to a TensorRT Network#

Example: Adding a Custom Layer with Dynamic Shapes Using C++#

Example: Adding a Custom Layer with Data-Dependent and Shape Input-Dependent Shapes Using C++#

Example: Adding a Custom Layer with INT8 I/O Support Using C++#

Adding Custom Layers using the Python API (TensorRT >= 10.6)#

Adding Custom Layers using the Python API (Advanced/TensorRT <= 10.5)#

Registration of a Python Plugin#

Building and Running TensorRT Engines Containing Python Plugins#

Implementing enqueue of a Python Plugin#

Translating Device Buffers/CUDA Stream Pointers in enqueue to other Frameworks#

Automatic Downcasting#

Example: Adding a Custom Layer to a TensorRT Network Using Python#

Enabling Timing Caching and Using Custom Tactics#

Sharing Custom Resources Among Plugins#

Example: Sharing Weights Downloaded Over a Network Among Different Plugins#

Using Custom Layers When Importing a Model with a Parser#

Plugin API Description#

IPluginV3 API Description#

IPluginCreatorV3One API Description#

Migrating V2 Plugins to IPluginV3#

Plugin Initialization and Termination#

Accessing Context-Specific Resources Provided by TensorRT#

Plugin Serialization and Deserialization#

Migrating Older V2 Plugins to IPluginV3#

Coding Guidelines for Plugins#

Plugin Shared Libraries#

Generating Plugin Shared Libraries#

Using Plugin Shared Libraries#

Implementing `enqueue` of a Python Plugin#

Translating Device Buffers/CUDA Stream Pointers in `enqueue` to other Frameworks#

`IPluginV3` API Description#

`IPluginCreatorV3One` API Description#

Migrating V2 Plugins to `IPluginV3`#

Migrating Older V2 Plugins to `IPluginV3`#