Dynamic Shapes: Core Concepts#
Specifying Runtime Dimensions#
When building a network, use -1 to denote a runtime dimension for an input tensor. For example, to create a 3D input tensor named foo where the last two dimensions are specified at runtime, and the first dimension is fixed at build time, issue the following.
1networkDefinition.addInput("foo", DataType::kFLOAT, Dims3(3, -1, -1))
1network_definition.add_input("foo", trt.float32, (3, -1, -1))
After choosing an optimization profile, you must set the input dimensions at run time (refer to Optimization Profiles). Let the input have dimensions [3,150,250]. After setting an optimization profile for the previous example, you would call:
1context.setInputShape("foo", Dims{3, {3, 150, 250}})
1context.set_input_shape("foo", (3, 150, 250))
At runtime, asking the engine for binding dimensions returns the same dimensions used to build the network, meaning you get a -1 for each runtime dimension. For example:
1engine.getTensorShape("foo")
Returns a Dims with dimensions {3, -1, -1}.
1engine.get_tensor_shape("foo")
Returns (3, -1, -1).
To get the actual dimensions, which are specific to each execution context, query the execution context:
1context.getTensorShape("foo")
Returns a Dims with dimensions {3, 150, 250}.
1context.get_tensor_shape(0)
Returns (3, 150, 250).
The return value of setInputShape for input only indicates consistency for the optimization profile set for that input. After all input binding dimensions are specified, you can check whether the entire network is consistent with the dynamic input shapes by querying the dimensions of the output bindings of the network. Here is an example that retrieves the dimensions of an output named bar:
nvinfer1::Dims outDims = context->getTensorShape("bar");
if (outDims.nbDims == -1) {
gLogError << "Invalid network output, this might be caused by inconsistent input shapes." << std::endl;
// abort inference
}
If a dimension k is data-dependent, such as it depends on the input of INonZeroLayer, outDims.d[k] will be -1. For more information on such outputs, refer to the Dynamically Shaped Output section.
Named Dimensions#
Both constant and runtime dimensions can be named. Naming dimensions provides two benefits:
For runtime dimensions, error messages use the dimension’s name. For example, if an input tensor
foohas dimensions[n,10,m], it is more helpful to get an error message aboutminstead of(#2 (SHAPE foo)).Dimensions with the same name are implicitly equal, which can help the optimizer generate a more efficient engine and diagnose mismatched dimensions at runtime. For example, suppose two inputs have dimensions
[n,10,m]and[n,13]. In that case, the optimizer knows the lead dimensions are always equal, and accidentally using the engine with mismatched values for n will be reported as an error.
You can use the same name for constant and runtime dimensions as long as they are always equal.
The following syntax examples set the name of the third dimension of the tensor to m.
1tensor.setDimensionName(2, "m")
1tensor.set_dimension_name(2, "m")
There are corresponding methods to get a dimension name:
1tensor.getDimensionName(2) // returns the name of the third dimension of the tensor, or nullptr if it does not have a name.
1tensor.get_dimension_name(2) # returns the name of the third dimension of the tensor, or None if it does not have a name.
When the input network is imported from an ONNX file, the ONNX parser automatically sets the dimension names using the names in the ONNX file. Therefore, if two dynamic dimensions are expected to be equal at runtime, specify the same name for these dimensions when exporting the ONNX file.
Dimension Constraint using IAssertionLayer#
Sometimes, two dynamic dimensions are not known to be equal statically but are guaranteed equal at runtime. Letting TensorRT know they are equal can help it build a more efficient engine. There are two ways to convey the equality constraint to TensorRT:
Give the dimensions the same name as described in the Named Dimensions section.
Use
IAssertionLayerto express the constraint. This technique is more general since it can convey trickier equalities.
For example, if the first dimension of tensor A is guaranteed to be one more than the first dimension of tensor B, then the constraint can be established by:
1// Assumes A and B are ITensor* and n is a INetworkDefinition&.
2auto shapeA = n.addShape(*A)->getOutput(0);
3auto firstDimOfA = n.addSlice(*shapeA, Dims{1, {0}}, Dims{1, {1}}, Dims{1, {1}})->getOutput(0);
4auto shapeB = n.addShape(*B)->getOutput(0);
5auto firstDimOfB = n.addSlice(*shapeB, Dims{1, {0}}, Dims{1, {1}}, Dims{1, {1}})->getOutput(0);
6static int32_t const oneStorage{1};
7auto one = n.addConstant(Dims{1, {1}}, Weights{DataType::kINT32, &oneStorage, 1})->getOutput(0);
8auto firstDimOfBPlus1 = n.addElementWise(*firstDimOfB, *one, ElementWiseOperation::kSUM)->getOutput(0);
9auto areEqual = n.addElementWise(*firstDimOfA, *firstDimOfBPlus1, ElementWiseOperation::kEQUAL)->getOutput(0);
10n.addAssertion(*areEqual, "oops");
1# Assumes `a` and `b` are ITensors and `n` is an INetworkDefinition
2shape_a = n.add_shape(a).get_output(0)
3first_dim_of_a = n.add_slice(shape_a, (0, ), (1, ), (1, )).get_output(0)
4shape_b = n.add_shape(b).get_output(0)
5first_dim_of_b = n.add_slice(shape_b, (0, ), (1, ), (1, )).get_output(0)
6one = n.add_constant((1, ), np.ones((1, ), dtype=np.int32)).get_output(0)
7first_dim_of_b_plus_1 = n.add_elementwise(first_dim_of_b, one, trt.ElementWiseOperation.SUM).get_output(0)
8are_equal = n.add_elementwise(first_dim_of_a, first_dim_of_b_plus_1, trt.ElementWiseOperation.EQUAL).get_output(0)
9n.add_assertion(are_equal, "oops")
If the dimensions violate the assertion at runtime, TensorRT will throw an error.
Optimization Profiles#
An optimization profile describes a range of dimensions for each network input and the dimensions the auto-tuner will use for optimization. You must create at least one optimization profile at build time when using runtime dimensions. Two profiles can specify disjoint or overlapping ranges.
For example, one profile might specify a minimum size of [3,100,200], a maximum size of [3,200,300], and optimization dimensions of [3,150,250], while another profile might specify min, max, and optimization dimensions of [3,200,100], [3,300,400], and [3,250,250].
Note
The memory usage for different profiles can change dramatically based on the dimensions specified by the min, max, and opt parameters. Some operations have tactics that only work for MIN=OPT=MAX, so when these values differ, the tactic is disabled.
To create an optimization profile, first construct an IOptimizationProfile. Then, set the min, optimization, and max dimensions and add them to the network configuration. The shapes defined by the optimization profile must define valid input shapes for the network. Here are the calls for the first profile mentioned previously for an input foo:
1IOptimizationProfile* profile = builder.createOptimizationProfile();
2profile->setDimensions("foo", OptProfileSelector::kMIN, Dims3(3,100,200);
3profile->setDimensions("foo", OptProfileSelector::kOPT, Dims3(3,150,250);
4profile->setDimensions("foo", OptProfileSelector::kMAX, Dims3(3,200,300);
5
6config->addOptimizationProfile(profile)
1profile = builder.create_optimization_profile();
2profile.set_shape("foo", (3, 100, 200), (3, 150, 250), (3, 200, 300))
3config.add_optimization_profile(profile)
At runtime, you must set an optimization profile before setting input dimensions. Profiles are numbered in the order they were added, starting at 0. Note that each execution context must use a separate optimization profile.
To choose the first optimization profile in the example, use:
1context.setOptimizationProfileAsync(0, stream)
1context.set_optimization_profile_async(0, stream)
The provided stream argument should be the same CUDA stream that will be used for the subsequent enqueueV3() invocation in this context. This ensures that the context executions happen after the optimization profile setup.
Suppose the associated CUDA engine has dynamic inputs. In that case, the optimization profile must be set at least once with a unique profile index that is not used by other execution contexts and that is not destroyed. For the first execution context created for an engine, profile 0 is implicitly chosen.
setOptimizationProfileAsync() can be called to switch between profiles. It must be called after any enqueueV3() operations finish in the current context. When multiple execution contexts run concurrently, it can switch to a formerly used profile already released by another execution context with different dynamic input dimensions.
setOptimizationProfileAsync() replaces the now deprecated setOptimizationProfile(). Using setOptimizationProfile() to switch between optimization profiles can cause GPU memory copy operations in the subsequent enqueueV3() operations. To avoid these calls during enqueue, use the setOptimizationProfileAsync() API instead.
Dynamically Shaped Output#
If the output of a network has a dynamic shape, several strategies are available to allocate the output memory.
If the dimensions of the output are computable from the dimensions of inputs, use IExecutionContext::getTensorShape() to get the dimensions of the output after providing the dimensions of the input tensors and input shape tensors. Use the IExecutionContext::inferShapes() method to check if you forgot to supply the necessary information.
Otherwise, if the dimensions of the output are not computable in advance or you are calling enqueueV3, associate an IOutputAllocator with the output. More specifically:
Derive your allocator class from
IOutputAllocator.Override the
reallocateOutputandnotifyShapemethods. TensorRT calls the first when it needs to allocate the output memory and the second when it knows the output dimensions. For example, the memory for the output ofINonZeroLayeris allocated before the layer runs.Here is an example derived class:
class MyOutputAllocator : nvinfer1::IOutputAllocator { public: void* reallocateOutput( char const* tensorName, void* currentMemory, uint64_t size, uint64_t alignment) override { // Allocate the output. Remember it for later use. outputPtr = /* depends on strategy, as discussed later ...*/ return outputPtr; } void notifyShape(char const* tensorName, Dims const& dims) { // Remember output dimensions for later use. outputDims = dims; } // Saved dimensions of the output Dims outputDims{}; // nullptr if memory could not be allocated void* outputPtr{nullptr}; };
Here’s an example of how it might be used:
std::unordered_map<std::string, MyOutputAllocator> allocatorMap; for (const char* name : /* names of outputs */) { Dims extent = context->getTensorShape(name); void* ptr; if (engine->getTensorLocation(name) == TensorLocation::kDEVICE) { if (/* extent.d contains -1 */) { auto allocator = std::make_unique<MyOutputAllocator>(); context->setOutputAllocator(name, allocator.get()); allocatorMap.emplace(name, std::move(allocator)); } else { ptr = /* allocate device memory per extent and format */ } } else { ptr = /* allocate cpu memory per extent and format */ } context->setTensorAddress(name, ptr); }
Several strategies can be used for implementing reallocateOutput:
- A
Defer allocation until the size is known. Do not call
IExecutionContext::setTensorAddress, or call it with anullptrfor the tensor address.- B
Preallocate enough memory based on what
IExecutionContext::getMaxOutputSizereports as an upper bound. This guarantees that the engine will not fail due to insufficient output memory, but the upper bound can be so high that it is useless.- C
If you have preallocated enough memory based on experience, use
IExecutionContext::setTensorAddressto tell TensorRT about it. If the tensor does not fit, makereallocateOutputreturnnullptr, which will cause the engine to fail gracefully.- D
Preallocate memory as in
C, but havereallocateOutputreturn a pointer to a bigger buffer if there is a fit problem. This increases the output buffer as needed.- E
Defer allocation until the size is known, like
A. Then, attempt to recycle that allocation in subsequent calls until a bigger buffer is requested, and then increase it like inD.Here’s an example derived class that implements
E:class FancyOutputAllocator : nvinfer1::IOutputAllocator { public: void reallocateOutput( char const* tensorName, void* currentMemory, uint64_t size, uint64_t alignment) override { if (size > outputSize) { // Need to reallocate cudaFree(outputPtr); outputPtr = nullptr; outputSize = 0; if (cudaMalloc(&outputPtr, size) == cudaSuccess) { outputSize = size; } } // If the cudaMalloc fails, outputPtr=nullptr, and engine // gracefully fails. return outputPtr; } void notifyShape(char const* tensorName, Dims const& dims) { // Remember output dimensions for later use. outputDims = dims; } // Saved dimensions of the output tensor Dims outputDims{}; // nullptr if memory could not be allocated void* outputPtr{nullptr}; // Size of allocation pointed to by output uint64_t outputSize{0}; ~FancyOutputAllocator() override { cudaFree(outputPtr); } };
TensorRT internally allocates memory asynchronously in the device’s current memory pool for networks with data-dependent shapes. Suppose the current device memory pool doesn’t have a release threshold set. In that case, performance degradation between runs can occur as the memory is returned to the operating system upon stream synchronization. In these cases, it’s recommended that you either provide the TensorRT runtime with a custom IGpuAllocator with a custom memory pool or experiment with setting the release threshold. More information about setting the release threshold can be found in Retaining Memory in the Pool and the Code Migration Guide.
Looking up Binding Indices for Multiple Optimization Profiles#
If you use enqueueV3 instead of the deprecated enqueueV2, you can skip this section because name-based methods such as IExecutionContext::setTensorAddress do not expect a profile suffix.
Each profile has separate binding indices in an engine built from multiple profiles. The names of I/O tensors for the K profile have [profile K] appended to them, with K written in decimal. For example, if the INetworkDefinition had the name foo, and bindingIndex refers to that tensor in the optimization profile with index 3, engine.getBindingName(bindingIndex) returns foo [profile 3].
Likewise, if using ICudaEngine::getBindingIndex(name) to get the index for a profile K beyond the first profile (K=0), append [profile K] to the name used in the INetworkDefinition. For example, if the tensor was called foo in the INetworkDefinition, engine.getBindingIndex("foo [profile 3]") returns the binding index of Tensor foo in optimization profile 3.
Always omit the suffix for K=0.
Bindings for Multiple Optimization Profiles#
This section explains the deprecated interface enqueueV2 and its binding indices. The newer interface enqueueV3 does away with binding indices.
Consider a network with four inputs, one output, and three optimization profiles in the IBuilderConfig. The engine has 15 bindings, five for each optimization profile, conceptually organized as a table:
Each row is a profile. Numbers in the table denote binding indices. The first profile has binding indices 0..4, the second has 5..9, and the third has 10..14.
The interfaces have an “auto-correct” in the scenario where the binding belongs to the first profile, but another profile was specified. TensorRT warns about the mistake in this case and then chooses the correct binding index from the same column.