Working with Dynamic Shapes#

TensorRT-RTX supports networks with dynamic shapes. Dynamic shapes is the ability to defer specifying some or all tensor dimensions until runtime and can be used through both the C++ and Python interfaces.

Overview#

To build a TensorRT-RTX engine with dynamic shapes, you must:

  1. Specify each runtime dimension of an input tensor by using -1 as a placeholder for the dimension.

  2. Specify one or more optimization profiles at build time that specify the permitted range of dimensions for inputs with runtime dimensions. For more information, refer to the Optimization Profiles section.

  3. To use the engine:

    1. Create an execution context from the engine, the same as without dynamic shapes.

    2. Specify one of the optimization profiles from step 2 that covers the input dimensions.

    3. Specify the input dimensions for the execution context. After setting input dimensions, you can get the output dimensions that TensorRT computes for the given input dimensions.

    4. Enqueue work.

To change the runtime dimensions, repeat steps 3b and 3c, which do not have to be repeated until the input dimensions change.

Specifying Runtime Dimensions#

When building a network, use -1 to denote a runtime dimension for an input tensor. For example, to create a 3D input tensor named foo where the last two dimensions are specified at runtime, and the first dimension is fixed at build time, issue the following.

1networkDefinition.addInput("foo", DataType::kFLOAT, Dims3(3, -1, -1))
1network_definition.add_input("foo", trt.float32, (3, -1, -1))

After choosing an optimization profile, you must set the input dimensions at run time (refer to Optimization Profiles). Let the input have dimensions [3,150,250]. After setting an optimization profile for the previous example, you would call:

1context.setInputShape("foo", Dims{3, {3, 150, 250}})
1context.set_input_shape("foo", (3, 150, 250))

At runtime, asking the engine for binding dimensions returns the same dimensions used to build the network, meaning you get a -1 for each runtime dimension. For example:

1engine.getTensorShape("foo")

Returns a Dims with dimensions {3, -1, -1}.

1engine.get_tensor_shape("foo")

Returns (3, -1, -1).

To get the actual dimensions, which are specific to each execution context, query the execution context:

1context.getTensorShape("foo")

Returns a Dims with dimensions {3, 150, 250}.

1context.get_tensor_shape(0)

Returns (3, 150, 250).

The return value of setInputShape for input only indicates consistency for the optimization profile set for that input. After all input binding dimensions are specified, you can check whether the entire network is consistent with the dynamic input shapes by querying the dimensions of the output bindings of the network. Here is an example that retrieves the dimensions of an output named bar:

nvinfer1::Dims outDims = context->getTensorShape("bar");

if (outDims.nbDims == -1) {
    gLogError << "Invalid network output, this might be caused by inconsistent input shapes." << std::endl;
    // abort inference
}

If a dimension k is data-dependent, for example, outDims.d[k] will be -1. For more information on such outputs, refer to the Dynamically Shaped Output section.

Named Dimensions#

Both constant and runtime dimensions can be named. Naming dimensions provides two benefits:

  1. For runtime dimensions, error messages use the dimension’s name. For example, if an input tensor foo has dimensions [n,10,m], it is more helpful to get an error message about m instead of (#2 (SHAPE foo)).

  2. Dimensions with the same name are implicitly equal, which can help the optimizer generate a more efficient engine and diagnose mismatched dimensions at runtime. For example, suppose two inputs have dimensions [n,10,m] and [n,13]. In that case, the optimizer knows the lead dimensions are always equal, and accidentally using the engine with mismatched values for n will be reported as an error.

You can use the same name for constant and runtime dimensions as long as they are always equal.

The following syntax examples set the name of the third dimension of the tensor to m.

1tensor.setDimensionName(2, "m")
1tensor.set_dimension_name(2, "m")

There are corresponding methods to get a dimensions name:

1tensor.getDimensionName(2) // returns the name of the third dimension of the tensor, or nullptr if it does not have a name.
1tensor.get_dimension_name(2) # returns the name of the third dimension of the tensor, or None if it does not have a name.

When the input network is imported from an ONNX file, the ONNX parser automatically sets the dimension names using the names in the ONNX file. Therefore, if two dynamic dimensions are expected to be equal at runtime, specify the same name for these dimensions when exporting the ONNX file.

Optimization Profiles#

An optimization profile describes a range of dimensions for each network input. You must create at least one optimization profile at build time when using runtime dimensions. Two profiles can specify disjoint or overlapping ranges.

For example, one profile might specify a minimum size of [3,100,200], a maximum size of [3,200,300], and optimization dimensions of [3,150,250], while another profile might specify min, max, and optimization dimensions of [3,200,100], [3,300,400], and [3,250,250].

Note

The memory usage for different profiles can change dramatically based on the dimensions specified by the min, max, and opt parameters. Some operations have kernels that only work for MIN=OPT=MAX, so when these values differ, the kernel is disabled.

To create an optimization profile, first construct an IOptimizationProfile. Then, set the min, optimization, and max dimensions and add them to the network configuration. The shapes defined by the optimization profile must define valid input shapes for the network. Here are the calls for the first profile mentioned previously for an input foo:

1IOptimizationProfile* profile = builder.createOptimizationProfile();
2profile->setDimensions("foo", OptProfileSelector::kMIN, Dims3(3,100,200);
3profile->setDimensions("foo", OptProfileSelector::kOPT, Dims3(3,150,250);
4profile->setDimensions("foo", OptProfileSelector::kMAX, Dims3(3,200,300);
5
6config->addOptimizationProfile(profile)
1profile = builder.create_optimization_profile();
2profile.set_shape("foo", (3, 100, 200), (3, 150, 250), (3, 200, 300))
3config.add_optimization_profile(profile)

At runtime, you must set an optimization profile before setting input dimensions. Profiles are numbered in the order they were added, starting at 0. Note that each execution context must use a separate optimization profile.

To choose the first optimization profile in the example, use:

1context.setOptimizationProfileAsync(0, stream)
1context.set_optimization_profile_async(0, stream)

The provided stream argument should be the same CUDA stream that will be used for the subsequent`` enqueue()``, enqueueV2(), or enqueueV3() invocation in this context. This ensures that the context executions happen after the optimization profile setup.

Suppose the associated TensorRT-RTX engine has dynamic inputs. In that case, the optimization profile must be set at least once with a unique profile index that is not used by other execution contexts and that is not destroyed. For the first execution context created for an engine, profile 0 is implicitly chosen.

setOptimizationProfileAsync() can be called to switch between profiles. It must be called after any enqueue(), enqueueV2(), or enqueueV3() operations finish in the current context. When multiple execution contexts run concurrently, it can switch to a formerly used profile already released by another execution context with different dynamic input dimensions.

setOptimizationProfileAsync() function replaces the now deprecated version of the API setOptimizationProfile(). Using setOptimizationProfile() to switch between optimization profiles can cause GPU memory copy operations in the subsequent enqueue() or enqueueV2() operations. To avoid these calls during enqueue, use setOptimizationProfileAsync() API instead.

Dynamically Shaped Output#

If the output of a network has a dynamic shape, several strategies are available to allocate the output memory.

If the dimensions of the output are computable from the dimensions of inputs, use IExecutionContext::getTensorShape() to get the dimensions of the output after providing the dimensions of the input tensors and input shape tensors. Use the IExecutionContext::inferShapes() method to check if you forgot to supply the necessary information.

Otherwise, if the dimensions of the output are not computable in advance or calling enqueueV3, associate an IOutputAllocator with the output. More specifically:

  1. Derive your allocator class from IOutputAllocator.

  2. Override the reallocateOutput and notifyShape methods. TensorRT calls the first when it needs to allocate the output memory and the second when it knows the output dimensions.

    Here is an example derived class:

    class MyOutputAllocator : nvinfer1::IOutputAllocator
    {
    public:
        void* reallocateOutput(
            char const* tensorName, void* currentMemory,
            uint64_t size, uint64_t alignment) override
        {
            // Allocate the output. Remember it for later use.
            outputPtr = /* depends on strategy, as discussed later …*/
            return outputPtr;
        }
    
        void notifyShape(char const* tensorName, Dims const& dims)
        {
            // Remember output dimensions for later use.
            outputDims = dims;
        }
    
        // Saved dimensions of the output
        Dims outputDims{};
    
        // nullptr if memory could not be allocated
        void* outputPtr{nullptr};
    };
    

    Here’s an example of how it might be used:

    std::unordered_map<std::string, MyOutputAllocator> allocatorMap;
    
    for (const char* name : /* names of outputs */)
    {
        Dims extent = context->getTensorShape(name);
        void* ptr;
        if (engine->getTensorLocation(name) == TensorLocation::kDEVICE)
        {
            if (/* extent.d contains -1 */)
            {
                auto allocator = std::make_unique<MyOutputAllocator>();
                context->setOutputAllocator(name, allocator.get());
                allocatorMap.emplace(name, std::move(allocator));
            }
            else
            {
                ptr = /* allocate device memory per extent and format */
            }
        }
        else
        {
            ptr = /* allocate cpu memory per extent and format */
        }
        context->setTensorAddress(name, ptr);
    }
    

Several strategies can be used for implementing reallocateOutput:

A

Defer allocation until the size is known. Do not call IExecution::setTensorAddress, or call it with a nullptr for the tensor address.

B

Preallocate enough memory based on what IExecutionTensor::getMaxOutputSize reports as an upper bound. This guarantees that the engine will not fail due to insufficient output memory, but the upper bound may be so high that it is useless.

C

If you have preallocated enough memory based on experience, use IExecution::setTensorAddress to tell TensorRT about it. If the tensor does not fit, make reallocateOutput return nullptr, which will cause the engine to fail gracefully.

D

Preallocate memory as in C, but have reallocateOutput return a pointer to a bigger buffer if there is a fit problem. This increases the output buffer as needed.

E

Defer allocation until the size is known, like A. Then, attempt to recycle that allocation in subsequent calls until a bigger buffer is requested, and then increase it like in D.

Here’s an example derived class that implements E:

class FancyOutputAllocator : nvinfer1::IOutputAllocator
{
public:
    void reallocateOutput(
        char const* tensorName, void* currentMemory,
        uint64_t size, uint64_t alignment) override
    {
        if (size > outputSize)
        {
            // Need to reallocate
            cudaFree(outputPtr);
            outputPtr = nullptr;
            outputSize = 0;
            if (cudaMalloc(&outputPtr, size) == cudaSuccess)
            {
                outputSize = size;
            }
        }
        // If the cudaMalloc fails, outputPtr=nullptr, and engine
        // gracefully fails.
        return outputPtr;
    }

    void notifyShape(char const* tensorName, Dims const& dims)
    {
        // Remember output dimensions for later use.
        outputDims = dims;
    }

    // Saved dimensions of the output tensor
    Dims outputDims{};

    // nullptr if memory could not be allocated
    void* outputPtr{nullptr};

    // Size of allocation pointed to by output
    uint64_t outputSize{0};

    ~FancyOutputAllocator() override
    {
        cudaFree(outputPtr);
    }
};

TensorRT internally allocates memory asynchronously in the device’s current memory pool for networks with data-dependent shapes. Suppose the current device memory pool doesn’t have a release threshold set. In that case, performance degradation between runs may occur as the memory is returned to the operating system upon stream synchronization. In these cases, it’s recommended that you either provide the TensorRT runtime with a custom IGpuAllocator with a custom memory pool or experiment with setting the release threshold. More information about setting the release threshold can be found in Retaining Memory in the Pool and the Code Migration Guide.

Restrictions For Dynamic Shapes#

The following layer restrictions arise because the layer’s weights have a fixed size:

  • IConvolutionLayer and IDeconvolutionLayer require that the channel dimension be a build time constant.

  • Int8 requires that the channel dimension be a build time constant.

  • Layers accepting additional shape inputs (IResizeLayer, IShuffleLayer, ISliceLayer) require that the additional shape inputs be compatible with the dimensions of the minimum and maximum optimization profiles as well as with the dimensions of the runtime data input; otherwise, it can lead to either a build time or runtime error.

Not all required build-time constants need to be set manually. TensorRT will infer shapes through the network layers, and only those that cannot be inferred to be build-time constants must be set manually.

For more information regarding layers, refer to the TensorRT-RTX Operator documentation.

Execution Tensors vs Shape Tensors#

TensorRT 8.5 largely erased the distinctions between execution tensors and shape tensors. However, when designing a network or analyzing performance, it can be helpful to understand the internals and where internal synchronization occurs.

Engines using dynamic shapes employ a ping-pong execution strategy:

  1. Compute the shapes of tensors on the CPU until a shape requiring GPU results is reached.

  2. Stream work to the GPU until you run out of work or reach an unknown shape. If the latter, synchronize and return to step 1.

An execution tensor is a traditional TensorRT tensor, while a shape tensor is related to shape calculations. A shape tensor must be of type Int32, Int64, Float, or Bool, its shape must be determinable at build time, and it must have no more than 64 elements. Refer to Shape Tensor I/O (Advanced-dynamic-shapes) for additional restrictions on shape tensors at network I/O boundaries. For example, there is an IShapeLayer whose output is a 1D tensor containing the dimensions of the input tensor, making the output a shape tensor. IShuffleLayer accepts an optional second input that can specify reshaping dimensions, which must be a shape tensor.

When TensorRT needs a shape tensor, but the tensor has been classified as an execution tensor, the runtime copies the tensor from the GPU to the CPU, incurring synchronization overhead.

Some layers are polymorphic regarding the kinds of tensors they handle. For example, IElementWiseLayer can sum two INT32 execution tensors or two INT32 shape tensors. The type of tensor depends on its ultimate use. If the sum is used to reshape another tensor, it is a shape tensor.

Shape Tensor I/O (Advanced)#

Sometimes, you need to use a shape tensor as a network I/O tensor. For example, consider a network consisting solely of an IShuffleLayer. TensorRT infers that the second input is a shape tensor, and ITensor::isShapeTensor returns true for it. As an input shape tensor, TensorRT requires two things:

  1. At build time: the optimization profile values of the shape tensor.

  2. At run time: the values of the shape tensor.

The shape of an input shape tensor is always known at build time. The values must be described since they can be used to specify the dimensions of execution tensors.

The optimization profile values can be set using IOptimizationProfile::setShapeValues. Similar to providing min, max, and optimization dimensions for execution tensors with runtime dimensions, you must provide min, max, and optimization values for shape tensors at build time.

The corresponding runtime method is IExecutionContext::setTensorAddress, which informs TensorRT where to find the shape tensor values.

Since the inference of execution tensor versus shape tensor is based on ultimate use, TensorRT cannot infer whether a network output is a shape tensor. You must explicitly indicate this using the method INetworkDefinition::markOutputForShapes.

This feature is useful for debugging shape information and for composing engines. For example, when building three engines for sub-networks A, B, and C, where connections between A to B or B to C might involve a shape tensor, you should build the networks in reverse order: C, B, and then A. After constructing network C, use ITensor::isShapeTensor to determine if an input is a shape tensor and use INetworkDefinition::markOutputForShapes to mark the corresponding output tensor in network B. Then check which inputs of B are shape tensors and mark the corresponding output tensor in network A.

Shape tensors at network boundaries must have the type Int32 or Int64; they cannot be of type Float or Bool. A workaround for Bool is to use Int32 for the I/O tensor, with zeros and ones, and convert to/from Bool using IIdentityLayer.

At runtime, whether a tensor is an I/O shape tensor can be determined via ICudaEngine::isShapeInferenceIO().

Static vs Dynamic Shapes from a JIT Perspective#

In this section and the following ones, we focus on the enhancements of TensorRT-RTX over standard TensorRT.

A major difference between TensorRT-RTX and standard TensorRT is that kernels are selected for optimal performance based on layer/tensor information and compiled in a separate JIT phase. For networks without dynamic shapes, since all layers and shapes are known in advance, this compilation occurs once during the creation of an execution context, allowing multiple inferences to run without triggering re-compilations. However, for dynamic shape networks, kernel selection cannot occur at context creation time because shapes are not known in advance. Instead, kernel selection must wait until inference time when input shapes are known. If the input tensor shape changes, new kernels may need to be selected and compiled, leading to longer inference latencies due to kernel recompilation, which can be a significant bottleneck (sometimes up to 10-100 times the inference latency). A naive JIT implementation would struggle to handle dynamic inference workloads efficiently.

TensorRT-RTX reduces or even eliminates these overheads with its smart caching and selection of necessary kernels to run the network (refer to the Advanced section). TensorRT-RTX compiles two types of kernels: fallback and shape-specialized (at inference time). Fallback kernels for any layer are compiled once at context creation time, are guaranteed to be functional for any shapes, but may not achieve peak inference performance. Shape-specialized kernels can only be compiled at inference time once the runtime shape is known and offer the best performance for that layer and shape combination.

By default, TensorRT-RTX combines these two types of kernels to balance compilation costs against inference performance for new shapes, swapping in shape-specialized kernels when they are compiled and available. To control CPU-only compilation costs, you can compile shape-specialized kernels in the background or avoid it altogether. This is determined by setting the kernel specialization strategy using the TensorRT-RTX API.

Additionally, the kernels used by TensorRT-RTX can be cached to disk and loaded in another inference session via the runtime cache. Refer to the Working with Runtime Cache section for more information and API usage.

Setting the Kernel Specialization Strategy#

The strategy for selecting shape-specialized kernels is introduced in the API during TensorRT-RTX’s context creation stage.

  1. Create the runtimeConfig object from the engine.

    IRuntimeConfig* runtimeConfig = engine->createRuntimeConfig();
    
    runtime_config = engine.create_runtime_config()
    
  2. Set the strategy for compiling shape-specialized kernels. The strategies are found in the DynamicShapesKernelSpecializationStrategy enum.

    runtimeConfig->setDynamicShapesKernelSpecializationStrategy(nvinfer1::DynamicShapesKernelSpecializationStrategy::kLAZY);
    
    runtime_config.dynamic_shapes_kernel_specialization_strategy = trt.DynamicShapesKernelSpecializationStrategy.LAZY
    

    Choose one of three valid strategies kLAZY, kEAGER, kNONE.

    • kLAZY: Lazy compilation of shape-specialized kernels for new shape inputs utilizes a fallback kernel to run inference until compilation is complete. Compilation occurs asynchronously on a separate CPU thread in the background. Once the shape-specialized kernel is compiled, it is used for subsequent inferences. This is the default behavior.

    • kEAGER: Eager compilation of shape-specialized kernels for all shape inputs occurs in a blocking manner, after which the kernels are used for subsequent inferences. This approach is useful for achieving reduced inference latencies on the GPU.

    • kNONE: No shape-specialized kernel compilation; inference always relies on a fallback kernel. This approach is useful for reducing compilation latencies on the CPU.

  3. Optionally, query the DynamicShapesKernelSpecializationStrategy enum.

    auto strategy = runtimeConfig->getDynamicShapesKernelSpecializationStrategy();
    
    strategy = runtime_config.dynamic_shapes_kernel_specialization_strategy
    
  4. Create the execution context with the configured runtimeConfig object.

    IExecutionContext *context = engine->createExecutionContext(runtimeConfig);
    
    context = engine.create_execution_context(runtime_config)
    

    During execution, TensorRT-RTX compiles the necessary kernels for inference and follows the specified specialization strategy.

Dynamic Shape Options via tensorrt_rtx#

The kernel specialization strategy is set in tensorrt_rtx using the --specializeStrategyDS flag, which accepts values lazy, eager, or none, corresponding to the enum values in DynamicShapesKernelSpecializationStrategy.

# sample command on Windows
tensorrt_rtx --onnx=sample_dynamic_shapes.onnx --specializeStrategyDS=lazy

Advanced#

Caching Details#

As mentioned in Static vs Dynamic Shapes from a JIT Perspective, TensorRT-RTX reduces and, based on the workloads, can even eliminate recompilation overhead with its smart caching and selection of necessary kernels to run the network. TensorRT-RTX performs caching on multiple levels: layer-level, kernel-level, and selection-level.

Layer-level Cache: The simplest form of caching, the layer-level cache, works for shapes previously encountered. TensorRT-RTX caches an in-memory representation of the layer that can be loaded on demand and executed.

Kernel-level Cache: This level of caching eliminates compilation hazards for previously unseen shapes. TensorRT-RTX implements a kernel-level cache—if a previously compiled kernel for a different shape (but the same layer) can run the current shape efficiently, it is loaded from the cache, injected with metadata for inference, stored in the layer-level cache, and then used to run inference.

Selection-level Cache: This cache is used when encountering new shapes and when previously compiled kernels cannot run the subgraph efficiently. It works in conjunction with TensorRT-RTX’s kernel selection mechanism. At context creation time, TensorRT-RTX selects and compiles subgraph-specific fallback kernels capable of running any shape. These kernels are functional but may not achieve peak inference performance. At inference time, when the shape is known, TensorRT-RTX constructs metadata for the fallback kernels, launches inference, and simultaneously compiles more efficient kernels in the background. Once these specialized kernels are compiled, they replace the fallback kernels. This approach, determined by the user via the kernel specialization strategy, balances compilation costs for new shapes against inference performance.

All caches work together, and if kernels, metadata, or layers are found in any cache level, TensorRT-RTX uses this information and exits early, minimizing CPU overheads.

However, caching may be imperfect, meaning certain combinations of layers and layer parameters are not suitable for kernel-level and selection-level caching in the current implementation. In these cases, only the layer-level cache will be used, which means a new kernel may get compiled for each new shape passed into the layer. These layers include:

  1. Non-grouped deconvolutions across all data types.

  2. Average pooling layers across all data types.

  3. 1D convolutions across all data types (2D and 3D convolutions are fully supported).

  4. Matmuls with INT4 and FP4 are not supported. FP8 has limited support. FP16, BF16, and FP32 are fully supported.

Compilation Considerations with Lazy Specialization#

Lazy specialization relies on background, asynchronous compilation, and kernel swapping without blocking inference. For specific inference workloads, such as those with low latencies or frequent shape changes, TensorRT-RTX may finish execution or context-switch before all background compilation is completed. In such cases, TensorRT-RTX guarantees a constant-time compilation shutdown operation by completing already-enqueued jobs, deleting jobs that are yet to be enqueued, and then returning.

For these workloads, you may not fully benefit from the inference performance improvements of lazy specialization and may prefer to use eager-mode compilation (i.e., using kEAGER for DynamicShapesKernelSpecializationStrategy), which offers the best inference performance. If compilation times are a concern with eager-mode, consider enabling the runtime cache as well.