Model Configuration¶

Each model in a Model Repository must include a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a config.pbtxt file specified as ModelConfig protobuf. In some cases, discussed in Generated Model Configuration, the model configuration can be generated automatically by the inference server and so does not need to be provided explicitly.

A minimal model configuration must specify name, platform, max_batch_size, input, and output.

As a running example consider a TensorRT model called mymodel that has two inputs, input0 and input1, and one output, output0, all of which are 16 entry float32 tensors. The minimal configuration is:

name: "mymodel"
platform: "tensorrt_plan"
max_batch_size: 8
input [
  {
    name: "input0"
    data_type: TYPE_FP32
    dims: [ 16 ]
  },
  {
    name: "input1"
    data_type: TYPE_FP32
    dims: [ 16 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [ 16 ]
  }
]

The name of the model must match the name of the model repository directory containing the model. The platform must be one of tensorrt_plan, tensorflow_graphdef, tensorflow_savedmodel, caffe2_netdef, onnxruntime_onnx, or custom.

The datatypes allowed for input and output tensors varies based on the type of the model. Section Datatypes describes the allowed datatypes and how they map to the datatypes of each model type.

Input shape specified by dims indicates the shape of input expected by the inference API. Output shape specified by dims indicates the shape of output returned by the inference API. Both input and output shape must have rank >= 1, that is, the empty shape [ ] is not allowed. The reshape property must be used if the underlying framework model or custom backend requires an input or output with an empty shape.

For models that support batched inputs the max_batch_size value must be >= 1. The TensorRT Inference Server assumes that the batching occurs along a first dimension that is not listed in the inputs or outputs. For the above example, the server expects to receive input tensors with shape [ x, 16 ] and produces an output tensor with shape [ x, 16 ], where x is the batch size of the request.

For models that do not support batched inputs the max_batch_size value must be zero. If the above example specified a max_batch_size of zero, the inference server would expect to receive input tensors with shape [ 16 ], and would produce an output tensor with shape [ 16 ].

For models that support input and output tensors with variable-size dimensions, those dimensions can be listed as -1 in the input and output configuration. For example, if a model requires a 2-dimensional input tensor where the first dimension must be size 4 but the second dimension can be any size, the model configuration for that input would include dims: [ 4, -1 ]. The inference server would then accept inference requests where that input tensor’s second dimension was any value >= 0. The model configuration can be more restrictive than what is allowed by the underlying model. For example, even though the model allows the second dimension to be any size, the model configuration could be specific as dims: [ 4, 4 ]. In this case, the inference server would only accept inference requests where the input tensor’s shape was exactly [ 4, 4 ].

Generated Model Configuration¶

By default, the model configuration file containing the required settings must be provided with each model. However, if the server is started with the --strict-model-config=false option, then in some cases the required portions of the model configuration file can be generated automatically by the inference server. The required portion of the model configuration are those settings shown in the example minimal configuration above. Specifically:

TensorRT Plan models do not require a model configuration file because the inference server can derive all the required settings automatically.
Some TensorFlow SavedModel models do not require a model configuration file. The models must specify all inputs and outputs as fixed-size tensors (with an optional initial batch dimension) for the model configuration to be generated automatically. The easiest way to determine if a particular SavedModel is supported is to try it with the server and check the console log and Status API to determine if the model loaded successfully.
ONNX Runtime ONNX models do not require a model configuration file because the inference server can derive all the required settings automatically. However, if the model supports batching, the initial batch dimension must be variable-size for all inputs and outputs.

When using --strict-model-config=false you can see the model configuration that was generated for a model by using the Status API.

The TensorRT Inference Server only generates the required portion of the model configuration file. You must still provide the optional portions of the model configuration if necessary, such as version_policy, optimization, scheduling and batching, instance_group, default_model_filename, cc_model_filenames, and tags.

Datatypes¶

The following table shows the tensor datatypes supported by the TensorRT Inference Server. The first column shows the name of the datatype as it appears in the model configuration file. The other columns show the corresponding datatype for the model frameworks supported by the server and for the Python numpy library. If a model framework does not have an entry for a given datatype, then the inference server does not support that datatype for that model.

Type	TensorRT	TensorFlow	Caffe2	ONNX Runtime	NumPy
TYPE_BOOL		DT_BOOL	BOOL	BOOL	bool
TYPE_UINT8		DT_UINT8	UINT8	UINT8	uint8
TYPE_UINT16		DT_UINT16	UINT16	UINT16	uint16
TYPE_UINT32		DT_UINT32		UINT32	uint32
TYPE_UINT64		DT_UINT64		UINT64	uint64
TYPE_INT8	kINT8	DT_INT8	INT8	INT8	int8
TYPE_INT16		DT_INT16	INT16	INT16	int16
TYPE_INT32	kINT32	DT_INT32	INT32	INT32	int32
TYPE_INT64		DT_INT64	INT64	INT64	int64
TYPE_FP16	kHALF	DT_HALF	FLOAT16	FLOAT16	float16
TYPE_FP32	kFLOAT	DT_FLOAT	FLOAT	FLOAT	float32
TYPE_FP64		DT_DOUBLE	DOUBLE	DOUBLE	float64
TYPE_STRING		DT_STRING		STRING	dtype(object)

For TensorRT each value is in the nvinfer1::DataType namespace. For example, nvinfer1::DataType::kFLOAT is the 32-bit floating-point datatype.

For TensorFlow each value is in the tensorflow namespace. For example, tensorflow::DT_FLOAT is the 32-bit floating-point value.

For Caffe2 each value is in the caffe2 namespace and is prepended with TensorProto_DataType_. For example, caffe2::TensorProto_DataType_FLOAT is the 32-bit floating-point datatype.

For ONNX Runtime each value is prepended with ONNX_TENSOR_ELEMENT_DATA_TYPE_. For example, ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT is the 32-bit floating-point datatype.

For Numpy each value is in the numpy module. For example, numpy.float32 is the 32-bit floating-point datatype.

Reshape¶

The ModelTensorReshape property on a model configuration input or output is used to indicate that the input or output shape accepted by the inference API differs from the input or output shape expected or produced by the underlying framework model or custom backend.

For an input, reshape can be used to reshape the input tensor to a different shape expected by the framework or backend. A common use-case is where a model that supports batching expects a batched input to have shape [ batch-size ], which means that the batch dimension fully describes the shape. For the inference API the equivalent shape [ batch-size, 1 ] must be specified since each input in the batch must specify a non-empty shape. For this case the input should be specified as:

input [
  {
    name: "in"
    dims: [ 1 ]
    reshape: { shape: [ ] }
  }
  ...

For an output, reshape can be used to reshape the output tensor produced by the framework or backend to a different shape that is returned by the inference API. A common use-case is where a model that supports batching expects a batched output to have shape [ batch-size ], which means that the batch dimension fully describes the shape. For the inference API the equivalent shape [ batch-size, 1 ] must be specified since each output in the batch must specify a non-empty shape. For this case the output should be specified as:

output [
  {
    name: "in"
    dims: [ 1 ]
    reshape: { shape: [ ] }
  }
  ...

Version Policy¶

Each model can have one or more versions available in the model repository. The nvidia::inferenceserver::ModelVersionPolicy schema allows the following policies.

All: All versions of the model that are available in the model repository are available for inferencing.
Latest: Only the latest ‘n’ versions of the model in the repository are available for inferencing. The latest versions of the model are the numerically greatest version numbers.
Specific: Only the specifically listed versions of the model are available for inferencing.

If no version policy is specified, then Latest (with num_version = 1) is used as the default, indicating that only the most recent version of the model is made available by the inference server. In all cases, the addition or removal of version subdirectories from the model repository can change which model version is used on subsequent inference requests.

Continuing the above example, the following configuration specifies that all versions of the model will be available from the server:

name: "mymodel"
platform: "tensorrt_plan"
max_batch_size: 8
input [
  {
    name: "input0"
    data_type: TYPE_FP32
    dims: [ 16 ]
  },
  {
    name: "input1"
    data_type: TYPE_FP32
    dims: [ 16 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [ 16 ]
  }
]
version_policy: { all { }}

Instance Groups¶

The TensorRT Inference Server can provide multiple execution instances of a model so that multiple simultaneous inference requests for that model can be handled simultaneously. The model configuration ModelInstanceGroup is used to specify the number of execution instances that should be made available and what compute resource should be used for those instances.

By default, a single execution instance of the model is created for each GPU available in the system. The instance-group setting can be used to place multiple execution instances of a model on every GPU or on only certain GPUs. For example, the following configuration will place two execution instances of the model to be available on each system GPU:

instance_group [
  {
    count: 2
    kind: KIND_GPU
  }
]

And the following configuration will place one execution instance on GPU 0 and two execution instances on GPUs 1 and 2:

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  },
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 1, 2 ]
  }
]

The instance group setting is also used to enable exection of a model on the CPU. The following places two execution instances on the CPU:

instance_group [
  {
    count: 2
    kind: KIND_CPU
  }
]

Scheduling And Batching¶

The TensorRT Inference Server supports batch inferencing by allowing individual inference requests to specify a batch of inputs. The inferencing for a batch of inputs is performed at the same time which is especially important for GPUs since it can greatly increase inferencing throughput. In many use-cases the individual inference requests are not batched, therefore, they do not benefit from the throughput benefits of batching.

The inference server contains multiple scheduling and batching algorithms that support many different model types and use-cases. More information about model types and schedulers can be found in Models And Schedulers.

Default Scheduler¶

The default scheduler is used for a model if none of the scheduling_choice configurations are specified. This scheduler distributes inference requests to all instances configured for the model.

Dynamic Batcher¶

Dynamic batching is a feature of the inference server that allows non-batched inference requests to be combined by the server, so that a batch is created dynamically, resulting in the same increased throughput seen for batched inference requests. The dynamic batcher should be used for stateless models. The dynamically created batches are distributed to all instances configured for the model.

Dynamic batching is enabled and configured independently for each model using the ModelDynamicBatching settings in the model configuration. These settings control the preferred size(s) of the dynamically created batches as well as a maximum time that requests can be delayed in the scheduler to allow other requests to join the dynamic batch.

The following configuration enables dynamic batching with preferred batch sizes of 4 and 8, and a maximum delay time of 100 microseconds:

dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

The size of generated batches can be examined in aggregate using Count metrics, see Metrics. Inference server verbose logging can be used to examine the size of individual batches.

Sequence Batcher¶

Like the dynamic batcher, the sequence batcher combines non-batched inference requests, so that a batch is created dynamically. Unlike the dynamic batcher, the sequence batcher should be used for stateful models where a sequence of inference requests must be routed to the same model instance. The dynamically created batches are distributed to all instances configured for the model.

Sequence batching is enabled and configured independently for each model using the ModelSequenceBatching settings in the model configuration. These settings control the sequence timeout as well as configuring how the inference server will send control signals to the model indicating sequence start and ready. See Models And Schedulers for more information and examples.

The size of generated batches can be examined in aggregate using Count metrics, see Metrics. Inference server verbose logging can be used to examine the size of individual batches.

Ensemble Scheduler¶

The ensemble scheduler must be used for ensemble models and cannot be used for any other type of model.

The ensemble scheduler is enabled and configured independently for each model using the ModelEnsembleScheduling settings in the model configuration. The settings describe the models that are included in the ensemble and the flow of tensor values between the models. See Ensemble Models for more information and examples.

Optimization Policy¶

The model configuration ModelOptimizationPolicy is used to specify optimization and prioritization settings for a model. These settings control if/how a model is optimized by the backend framework and how it is scheduled and executed by the inference server. See the protobuf documentation for the currently available settings.