Enabling Timing Caching and Using Custom Tactics#

IPluginV3 provides more control over the profiling of custom layers, which were unavailable with V2 plugins and earlier. One such feature is enabling timing caching. If a TensorRT network contains multiple instances of the same plugin, identically configured (such as same plugin attribute values) and handling identical input-output shapes and types, then it would make sense to time (measure latency) of only one instance, cache the latency, and skip timing the rest of the instances. This could enable large savings in terms of engine build time.

Timing caching for IPluginV3 plugins is an opt-in feature; to opt-in, the plugin must advertise a non-null timing cache ID.

1char const* FooPlugin::getTimingCacheID() noexcept override
2{
3    // return nullptr to disable timing caching (default behavior)
4    // return non-null string to enable timing caching
5}
1def FooPlugin(trt.IPluginV3, trt.IPluginV3OneBuild, ...):
2    def __init__(self):
3        # set to None to disable timing caching
4        self.timing_cache_id = value

Note the following regarding the timing cache ID:

  • The user-provided timing cache ID should be considered a suffix to a larger cache ID; TensorRT automatically forms a prefix by considering the plugin’s input/output shape and format information. Usually, the user-provided timing cache ID could consist of plugin attributes and their values.

  • It must reflect the plugin’s creation state and not evolve after creation.

For V2 plugins, TensorRT only times the plugin for any (multiple) type/format combinations it claims to support. With IPluginV3, plugins can also ensure custom tactics are timed, and TensorRT uses the fastest tactic. For example, the plugin can have one of two kernels to compute the output, and it cannot be possible to predict which one would be fastest on a specific platform and for specific input/output shapes and formats. It is possible to ask TensorRT to time the plugin for each tactic for each format combination, figure out the fastest such configuration, and use that during inference.

Note

  • TensorRT can choose not to time the plugin if it only supports one type/format combination and either does not use custom tactics or only advertises one.

  • For IPluginV3OneBuild, TensorRT times a maximum of getFormatCombinationLimit() type/format combinations for each tactic; override this method to increase/decrease this limit depending on need.

To get started, advertise the custom tactics to TensorRT:

 1int32_t FooPlugin::getNbTactics() noexcept override
 2{
 3    return 2; // return 0 to disable custom tactics (default behavior)
 4}
 5
 6int32_t FooPlugin::getValidTactics(int32_t* tactics, int32_t nbTactics) noexcept override
 7{
 8    tactics[0] = 1;
 9    tactics[1] = 2;
10    return 0;
11}
1def get_valid_tactics(self):
2    return [1, 2] # return empty vector to disable custom tactics (default behavior)

Any strictly positive integer could be used as a custom tactic value (TensorRT reserves 0 as the default tactic).

When the plugin is timed, configurePlugin() is guaranteed to be called with the current input/output format combination before getValidTactics() is called. Therefore, it is possible to advertise a different set of tactics per input/output format combination. For example, for a plugin that supports FP32 and FP16, tactic 1 can be restricted to only FP16 while supporting tactics 1 and 2 for FP32.

During the engine build, when auto-tuning the plugin, TensorRT will communicate the tactic for the subsequent enqueue() by invoking IPluginV3OneRuntime::setTactic (C++, Python). When an engine is deserialized, TensorRT will invoke setTactic after the plugin has been created to communicate the best tactic chosen for the plugin. Even if custom tactics are not used, setTactic will be called with the default tactic value 0.

Sharing Custom Resources Among Plugins#

Starting in TensorRT 10.0, a key-value store is associated with the plugin registry. This store can store user-implemented IPluginResource (C++, Python) objects against a string key. This functionality can be used to share state or some resources among different plugins. Note that it is not tied to IPluginV3 (or even to plugin interfaces).

Let us explore an example.

Example: Sharing Weights Downloaded Over a Network Among Different Plugins#

Assume that several plugins need access to the same weights, W. Due to licensing restrictions, you might prefer that these weights be downloaded when the engine runs. However, due to W’s large size, it is also desirable that only one copy is downloaded, which is shared among all plugins needing access.

  1. Implement SharedWeights class, which implements IPluginResource.

  2. Each plugin that requires access to the weights requests an instance of initialized (downloaded) SharedWeights by calling IPluginRegistry::acquirePluginResource(...) (C++, Python).

    1IPluginResource* acquirePluginResource(char const* key, IPluginResource* resource)
    
    1acquire_plugin_resource(self: trt.IPluginRegistry, key: str, resource: trt.IPluginResource) -> trt.IPluginResource
    

    The first time acquirePluginResource is called against a particular key, TensorRT registers a clone of the provided plugin resource instead of the object passed as a resource. The registered object is obtained by invoking resource->clone(). Therefore, it is best practice only to initialize clones – in this case, the weight download can be done in IPluginResource::clone().

  3. After each plugin has finished using the weights, it can call IPluginRegistry::releasePluginResource() to signal that it no longer wishes to use them.

    1int32_t releasePluginResource(char const* key)
    
    1release_plugin_resource(self: trt.IPluginRegistry, key: str) -> None
    

    TensorRT performs reference counting on the acquirePluginResource and releasePluginResource calls made against a particular key and will call IPluginResource::release() if and when the reference count reaches zero. In this example, this functionality can be leveraged to free up the memory used by the weights when all plugins have finished using it.

  4. Finally, the SharedWeights class can be implemented as follows:

    class SharedWeights : public IPluginResource
    {
    public:
        SharedWeights(bool init = false)
        {
            if(init)
            {
                PLUGIN_CHECK(cudaMalloc((void**) &cloned->mWeights, ...));
            }
        }
    
        int32_t release() noexcept override
        {
            try
            {
                PLUGIN_CHECK(cudaFree(mWeights));
                mWeights = nullptr;
            }
            catch
            {
                return -1;
            }
            return 0;
        }
    
        IPluginResource* clone() noexcept override
        {
            try
            {
                auto cloned = std::make_unique<SharedWeights>(/* init */ true);
                //
                // Download the weights
                //
                // Copy to device memory
                PLUGIN_CHECK(cudaMemcpy(cloned->mWeights, ...));
            }
            catch
            {
                return nullptr;
            }
            return cloned.release();
        }
    
        ~SharedWeights() override
        {
            release();
        }
    
        float* mWeights{nullptr};
    };
    

Say FooPlugin needs access to the weights. It can request the weights when it is being made ready for inference. This can be done in IPluginV3OneRuntime::onShapeChange, which will be called at least once for plugins about to be enqueue() during both the build and runtime phases.

int32_t onShapeChange(
    PluginTensorDesc const* in, int32_t nbInputs, PluginTensorDesc const* out, int32_t nbOutputs) noexcept override
{
    SharedWeights w{};
    mW = static_cast<SharedWeights*>(getPluginRegistry()->acquirePluginResource("W", &w))->mWeights;
    return 0;
}

The acquired weights (mW) can then be used in the subsequent enqueue(). To wrap up, the plugin can signal intent to release in its destructor (note that there is no separate release resource routine similar to IPluginV2DynamicExt::terminate() in IPluginV3).

~FooPlugin() override
{
    TRY
    {
        PLUGIN_CHECK(getPluginRegistry()->releasePluginResource("W"));
    }
    CATCH
    {
        // Error handling
    }
}

All plugins requiring access to the weights can use the same code above. The reference counting mechanism will ensure the weights’ availability and proper freeing.

Using Custom Layers When Importing a Model with a Parser#

The ONNX parser automatically attempts to import unrecognized nodes as plugins. If a plugin with the same op_type as the node is found in the plugin registry, the parser forwards the node’s attributes to the plugin creator as plugin field parameters to create the plugin. By default, the parser uses "1" as the plugin version and """ as the plugin namespace. This behavior can be overridden by setting a plugin_version and plugin_namespace string attribute in the corresponding ONNX node.

Added in version 10.15.1: Improved handling for TensorRT plugins that share names with standard ONNX operators.

When a TensorRT plugin shares a name with a standard ONNX operator, the ONNX parser now provides better control over operator resolution through two key enhancements:

1. Attribute-Based Prioritization

The ONNX parser determines which implementation to use based on the presence of a plugin_namespace attribute on the graph node:

  • If the attribute is present: The parser prioritizes matching the node to a plugin.

  • If the attribute is absent: The parser prioritizes matching to a standard ONNX operator or function.

This behavior ensures that you can explicitly specify when a plugin should be used by setting the plugin_namespace attribute.

2. Plugin Override Flag

A new parser flag, kENABLE_PLUGIN_OVERRIDE, provides direct control over plugin precedence:

  • C++ API: OnnxParserFlag::kENABLE_PLUGIN_OVERRIDE

  • Python API: OnnxParserFlag.ENABLE_PLUGIN_OVERRIDE

By default, this flag is OFF to prevent unintended overrides of standard ONNX operators. When enabled, the parser behavior changes:

  • The parser directly matches the node type to any loaded plugin name, giving plugins precedence.

  • The plugin_namespace attribute is no longer required for plugin matching.

  • The parser only falls back to a standard ONNX operator or function if no matching plugin is found.

Note

Use the kENABLE_PLUGIN_OVERRIDE flag with caution. When enabled, plugins take precedence over standard ONNX operators, which can lead to unexpected behavior if plugins are inadvertently loaded with names that conflict with ONNX operators.

Sometimes, you can modify an ONNX graph before importing it into TensorRT. For example, to replace a set of ops with a plugin node. To accomplish this, you can use the ONNX GraphSurgeon utility. For details on how to use ONNX-GraphSurgeon to replace a subgraph, refer to this example.

For more examples, refer to the onnx_packnet sample.