model_config.proto¶
-
enum
DataType
¶ Data types supported for input and output tensors.
-
enumerator
DataType
::
INVALID
= 0¶
-
enumerator
DataType
::
BOOL
= 1¶
-
enumerator
DataType
::
UINT8
= 2¶
-
enumerator
DataType
::
UINT16
= 3¶
-
enumerator
DataType
::
UINT32
= 4¶
-
enumerator
DataType
::
UINT64
= 5¶
-
enumerator
DataType
::
INT8
= 6¶
-
enumerator
DataType
::
INT16
= 7¶
-
enumerator
DataType
::
INT32
= 8¶
-
enumerator
DataType
::
INT64
= 9¶
-
enumerator
DataType
::
FP16
= 10¶
-
enumerator
DataType
::
FP32
= 11¶
-
enumerator
DataType
::
FP64
= 12¶
-
enumerator
DataType
::
STRING
= 13¶
-
enumerator
-
message
ModelInstanceGroup
¶ A group of one or more instances of a model and resources made available for those instances.
-
enum
Kind
¶ Kind of this instance group.
-
enumerator
Kind
::
KIND_AUTO
= 0¶ This instance group represents instances that can run on either CPU or GPU. If all GPUs listed in ‘gpus’ are available then instances will be created on GPU(s), otherwise instances will be created on CPU.
-
enumerator
Kind
::
KIND_GPU
= 1¶ This instance group represents instances that must run on the GPU.
-
enumerator
Kind
::
KIND_CPU
= 2¶ This instance group represents instances that must run on the CPU.
-
enumerator
Kind
::
KIND_MODEL
= 3¶ This instance group represents instances that should run on the CPU and/or GPU(s) as specified by the model or backend itself. The inference server will not override the model/backend settings. Currently, this option is supported only for Tensorflow models.
-
enumerator
-
string
name
¶ Optional name of this group of instances. If not specified the name will be formed as <model name>_<group number>. The name of individual instances will be further formed by a unique instance number and GPU index:
-
Kind
kind
¶ The kind of this instance group. Default is KIND_AUTO. If KIND_AUTO or KIND_GPU then both ‘count’ and ‘gpu’ are valid and may be specified. If KIND_CPU or KIND_MODEL only ‘count’ is valid and ‘gpu’ cannot be specified.
-
int32
count
¶ For a group assigned to GPU, the number of instances created for each GPU listed in ‘gpus’. For a group assigned to CPU the number of instances created. Default is 1.
-
int32
gpus
(repeated)¶ GPU(s) where instances should be available. For each GPU listed, ‘count’ instances of the model will be available. Setting ‘gpus’ to empty (or not specifying at all) is eqivalent to listing all available GPUs.
-
string
profile
(repeated)¶ For TensorRT models, using inputs with dynamic shape, this parameter specifies a set of optimization profiles available to this instance group. The inference server will choose the optimal profile based on the shapes of the input tensors. This field should lie between 0 and <TotalNumberOfOptimizationProfilesInPlanModel> - 1 and be specified only for TensorRT backend, otherwise an error will be generated.
-
enum
-
message
ModelTensorReshape
¶ Reshape specification for input and output tensors.
-
int64
shape
(repeated)¶ The shape to use for reshaping.
-
int64
-
message
ModelInput
¶ An input required by the model.
-
enum
Format
¶ The format for the input.
-
enumerator
Format
::
FORMAT_NONE
= 0¶ The input has no specific format. This is the default.
-
enumerator
Format
::
FORMAT_NHWC
= 1¶ HWC image format. Tensors with this format require 3 dimensions if the model does not support batching (max_batch_size = 0) or 4 dimensions if the model does support batching (max_batch_size >= 1). In either case the ‘dims’ below should only specify the 3 non-batch dimensions (i.e. HWC or CHW).
-
enumerator
Format
::
FORMAT_NCHW
= 2¶ CHW image format. Tensors with this format require 3 dimensions if the model does not support batching (max_batch_size = 0) or 4 dimensions if the model does support batching (max_batch_size >= 1). In either case the ‘dims’ below should only specify the 3 non-batch dimensions (i.e. HWC or CHW).
-
enumerator
-
string
name
¶ The name of the input.
-
int64
dims
(repeated)¶ The dimensions/shape of the input tensor that must be provided when invoking the inference API for this model.
-
ModelTensorReshape
reshape
¶ The shape expected for this input by the backend. The input will be reshaped to this before being presented to the backend. The reshape must have the same number of elements as the input shape specified by ‘dims’. Optional.
-
bool
is_shape_tensor
¶ Whether or not the input is a shape tensor to the model. This field is currently supported only for the TensorRT model. An error will be generated if this specification does not comply with underlying model.
-
bool
allow_ragged_batch
¶ Whether or not the input is allowed to be “ragged” in a dynamically created batch. Default is false indicating that two requests will only be batched if this tensor has the same shape in both requests. True indicates that two requests can be batched even if this tensor has a different shape in each request. A true value is currently supported only for custom models.
-
enum
-
message
ModelOutput
¶ An output produced by the model.
-
string
name
¶ The name of the output.
-
int64
dims
(repeated)¶ The dimensions/shape of the output tensor.
-
ModelTensorReshape
reshape
¶ The shape produced for this output by the backend. The output will be reshaped from this to the shape specifed in ‘dims’ before being returned in the inference response. The reshape must have the same number of elements as the output shape specified by ‘dims’. Optional.
-
string
label_filename
¶ The label file associated with this output. Should be specified only for outputs that represent classifications. Optional.
-
bool
is_shape_tensor
¶ Whether or not the output is a shape tensor to the model. This field is currently supported only for the TensorRT model. An error will be generated if this specification does not comply with underlying model.
-
string
-
message
ModelVersionPolicy
¶ Policy indicating which versions of a model should be made available by the inference server.
-
message
Latest
¶ Serve only the latest version(s) of a model. This is the default policy.
-
uint32
num_versions
¶ Serve only the ‘num_versions’ highest-numbered versions. T The default value of ‘num_versions’ is 1, indicating that by default only the single highest-number version of a model will be served.
-
uint32
-
message
All
¶ Serve all versions of the model.
-
message
-
message
ModelOptimizationPolicy
¶ Optimization settings for a model. These settings control if/how a model is optimized and prioritized by the backend framework when it is loaded.
-
message
Graph
¶ Enable generic graph optimization of the model. If not specified the framework’s default level of optimization is used. Supports TensorFlow graphdef and savedmodel and Onnx models. For TensorFlow causes XLA to be enabled/disabled for the model. For Onnx defaults to enabling all optimizations, -1 enables only basic optimizations, +1 enables only basic and extended optimizations.
-
int32
level
¶ The optimization level. Defaults to 0 (zero) if not specified.
-1: Disabled
0: Framework default
1+: Enable optimization level (greater values indicate higher optimization levels)
-
int32
-
enum
ModelPriority
¶ Model priorities. A model will be given scheduling and execution preference over models at lower priorities. Current model priorities only work for TensorRT models.
-
enumerator
ModelPriority
::
PRIORITY_DEFAULT
= 0¶ The default model priority.
-
enumerator
ModelPriority
::
PRIORITY_MAX
= 1¶ The maximum model priority.
-
enumerator
ModelPriority
::
PRIORITY_MIN
= 2¶ The minimum model priority.
-
enumerator
-
message
Cuda
¶ CUDA-specific optimization settings.
-
bool
graphs
¶ Use CUDA graphs API to capture model operations and execute them more efficiently. Currently only recognized by TensorRT backend.
-
bool
-
message
ExecutionAccelerators
¶ Specify the preferred execution accelerators to be used to execute the model. Currently only recognized by ONNX Runtime backend and TensorFlow backend.
For ONNX Runtime backend, it will deploy the model with the execution accelerators by priority, the priority is determined based on the order that they are set, i.e. the provider at the front has highest priority. Overall, the priority will be in the following order:
<gpu_execution_accelerator> (if instance is on GPU) CUDA Execution Provider (if instance is on GPU) <cpu_execution_accelerator> Default CPU Execution Provider
-
message
Accelerator
¶ Specify the accelerator to be used to execute the model. Accelerator with the same name may accept different parameters depending on the backends.
-
string
name
¶ The name of the execution accelerator.
-
map<string, string>
parameters
¶ Additional paremeters used to configure the accelerator.
-
Accelerator
gpu_execution_accelerator
(repeated)¶ The preferred execution provider to be used if the model instance is deployed on GPU.
For ONNX Runtime backend, possible value is “tensorrt” as name, and no parameters are required.
For TensorFlow backend, possible values are “tensorrt”, “gpu_io”.
- For “tensorrt”, the following parameters can be specified:
“precision_mode”: The precision used for optimization. Allowed values are “FP32” and “FP16”. Default value is “FP32”.
“max_cached_engines”: The maximum number of cached TensorRT engines in dynamic TensorRT ops. Default value is 100.
“minimum_segment_size”: The smallest model subgraph that will be considered for optimization by TensorRT. Default value is 3.
“max_workspace_size_bytes”: The maximum GPU memory the model can use temporarily during execution. Default value is 1GB.
For “gpu_io”, no parameters are required. If set, the model will be executed using TensorFlow Callable API to set input and output tensors in GPU memory if possible, which can reduce data transfer overhead if the model is used in ensemble. However, the Callable object will be created on model creation and it will request all outputs for every model execution, which may impact the performance if a request does not require all outputs. This optimization will only take affect if the model instance is created with KIND_GPU.
-
Accelerator
cpu_execution_accelerator
(repeated)¶ The preferred execution provider to be used if the model instance is deployed on CPU.
For ONNX Runtime backend, possible value is “openvino” as name, and no parameters are required.
-
string
-
message
PinnedMemoryBuffer
¶ Specify whether to use a pinned memory buffer when transferring data between non-pinned system memory and GPU memory. Using a pinned memory buffer for system from/to GPU transfers will typically provide increased performance. For example, in the common use case where the request provides inputs and delivers outputs via non-pinned system memory, if the model instance accepts GPU IOs, the inputs will be processed by two copies: from non-pinned system memory to pinned memory, and from pinned memory to GPU memory. Similarly, pinned memory will be used for delivering the outputs.
-
bool
enable
¶ Use pinned memory buffer. Default is true.
-
bool
-
ModelPriority
priority
¶ The priority setting for the model. Optional.
-
ExecutionAccelerators
execution_accelerators
¶ The accelerators used for the model. Optional.
-
PinnedMemoryBuffer
input_pinned_memory
¶ Use pinned memory buffer when the data transfer for inputs is between GPU memory and non-pinned system memory. Default is true.
-
PinnedMemoryBuffer
output_pinned_memory
¶ Use pinned memory buffer when the data transfer for outputs is between GPU memory and non-pinned system memory. Default is true.
-
message
-
message
ModelQueuePolicy
¶ Queue policy for inference requests.
-
enum
TimeoutAction
¶ The action applied to timed-out requests.
-
enumerator
Action
::
REJECT
= 0¶ Reject the request and return error message accordingly.
-
enumerator
Action
::
DELAY
= 1¶ Delay the request until all other requests at the same (or higher) priority levels that have not reached their timeouts are processed. A delayed request will eventually be processed, but may be delayed indefinitely due to newly arriving requests.
-
enumerator
-
TimeoutAction
timeout_action
¶ The action applied to timed-out request. The default action is REJECT.
-
uint64
default_timeout_microseconds
¶ The default timeout for every request, in microseconds. The default value is 0 which indicates that no timeout is set.
-
bool
allow_timeout_override
¶ Whether individual request can override the default timeout value. When true, individual requests can set a timeout that is less than the default timeout value but may not increase the timeout. The default value is false.
-
uint32
max_queue_size
¶ The maximum queue size for holding requests. A request will be rejected immediately if it can’t be enqueued because the queue is full. The default value is 0 which indicates that no maximum queue size is enforced.
-
enum
-
message
ModelDynamicBatching
¶ Dynamic batching configuration. These settings control how dynamic batching operates for the model.
-
int32
preferred_batch_size
(repeated)¶ Preferred batch sizes for dynamic batching. If a batch of one of these sizes can be formed it will be executed immediately. If not specified a preferred batch size will be chosen automatically based on model and GPU characteristics.
-
uint64
max_queue_delay_microseconds
¶ The maximum time, in microseconds, a request will be delayed in the scheduling queue to wait for additional requests for batching. Default is 0.
-
bool
preserve_ordering
¶ Should the dynamic batcher preserve the ordering of responses to match the order of requests received by the scheduler. Default is false. If true, the responses will be returned in the same order as the order of requests sent to the scheduler. If false, the responses may be returned in arbitrary order. This option is specifically needed when a sequence of related inference requests (i.e. inference requests with the same correlation ID) are sent to the dynamic batcher to ensure that the sequence responses are in the correct order.
-
uint32
priority_levels
¶ The number of priority levels to be enabled for the model, the priority level starts from 1 and 1 is the highest priority. Requests are handled in priority order with all priority 1 requests processed before priority 2, all priority 2 requests processed before priority 3, etc. Requests with the same priority level will be handled in the order that they are received.
-
uint32
default_priority_level
¶ The priority level used for requests that don’t specify their priority. The value must be in the range [ 1, ‘priority_levels’ ].
-
ModelQueuePolicy
default_queue_policy
¶ The default queue policy used for requests that don’t require priority handling and requests that specify priority levels where there is no specific policy given. If not specified, a policy with default field values will be used.
-
map<uint32, ModelQueuePolicy>
priority_queue_policy
¶ Specify the queue policy for the priority level. The default queue policy will be used if a priority level doesn’t specify a queue policy.
-
int32
-
message
ModelSequenceBatching
¶ Sequence batching configuration. These settings control how sequence batching operates for the model.
-
message
Control
¶ A control is a signal that the sequence batcher uses to communicate with a backend.
-
enum
Kind
¶ The kind of the control.
-
enumerator
Kind
::
CONTROL_SEQUENCE_START
= 0¶ A new sequence is/is-not starting. If true a sequence is starting, if false a sequence is continuing. Must specify either int32_false_true or fp32_false_true for this control. This control is optional.
-
enumerator
Kind
::
CONTROL_SEQUENCE_READY
= 1¶ A sequence is/is-not ready for inference. If true the input tensor data is valid and should be used. If false the input tensor data is invalid and inferencing should be “skipped”. Must specify either int32_false_true or fp32_false_true for this control. This control is optional.
-
enumerator
Kind
::
CONTROL_SEQUENCE_END
= 2¶ A sequence is/is-not ending. If true a sequence is ending, if false a sequence is continuing. Must specify either int32_false_true or fp32_false_true for this control. This control is optional.
-
enumerator
Kind
::
CONTROL_SEQUENCE_CORRID
= 3¶ The correlation ID of the sequence. The correlation ID is an uint64_t value that is communicated in whole or in part by the tensor. The tensor’s datatype must be specified by data_type and must be TYPE_UINT64, TYPE_INT64, TYPE_UINT32 or TYPE_INT32. If a 32-bit datatype is specified the correlation ID will be truncated to the low-order 32 bits. This control is optional.
-
enumerator
-
int32
int32_false_true
(repeated)¶ The control’s true and false setting is indicated by setting a value in an int32 tensor. The tensor must be a 1-dimensional tensor with size equal to the batch size of the request. ‘int32_false_true’ must have two entries: the first the false value and the second the true value.
-
float
fp32_false_true
(repeated)¶ The control’s true and false setting is indicated by setting a value in a fp32 tensor. The tensor must be a 1-dimensional tensor with size equal to the batch size of the request. ‘fp32_false_true’ must have two entries: the first the false value and the second the true value.
-
enum
-
message
ControlInput
¶ The sequence control values to communicate by a model input.
-
string
name
¶ The name of the model input.
-
string
-
message
StrategyDirect
¶ The sequence batcher uses a specific, unique batch slot for each sequence. All inference requests in a sequence are directed to the same batch slot in the same model instance over the lifetime of the sequence. This is the default strategy.
-
message
StrategyOldest
¶ The sequence batcher maintains up to ‘max_candidate_sequences’ candidate sequences. ‘max_candidate_sequences’ can be greater than the model’s ‘max_batch_size’. For inferencing the batcher chooses from the candidate sequences up to ‘max_batch_size’ inference requests. Requests are chosen in an oldest-first manner across all candidate sequences. A given sequence is not guaranteed to be assigned to the same batch slot for all inference requests of that sequence.
-
int32
max_candidate_sequences
¶ Maximum number of candidate sequences that the batcher maintains. Excess seqences are kept in an ordered backlog and become candidates when existing candidate sequences complete.
-
int32
preferred_batch_size
(repeated)¶ Preferred batch sizes for dynamic batching of candidate sequences. If a batch of one of these sizes can be formed it will be executed immediately. If not specified a preferred batch size will be chosen automatically based on model and GPU characteristics.
-
uint64
max_queue_delay_microseconds
¶ The maximum time, in microseconds, a candidate request will be delayed in the dynamic batch scheduling queue to wait for additional requests for batching. Default is 0.
-
int32
-
oneof
strategy_choice
¶ The strategy used by the sequence batcher. Default strategy is ‘direct’.
-
StrategyDirect
direct
¶ StrategyDirect scheduling strategy.
-
StrategyOldest
oldest
¶ StrategyOldest scheduling strategy.
-
StrategyDirect
-
uint64
max_sequence_idle_microseconds
¶ The maximum time, in microseconds, that a sequence is allowed to be idle before it is aborted. The inference server considers a sequence idle when it does not have any inference request queued for the sequence. If this limit is exceeded, the inference server will free the sequence slot allocated by the sequence and make it available for another sequence. If not specified (or specified as zero) a default value of 1000000 (1 second) is used.
-
ControlInput
control_input
(repeated)¶ The model input(s) that the server should use to communicate sequence start, stop, ready and similar control values to the model.
-
message
-
message
ModelEnsembling
¶ Model ensembling configuration. These settings specify the models that compose the ensemble and how data flows between the models.
-
message
Step
¶ Each step specifies a model included in the ensemble, maps ensemble tensor names to the model input tensors, and maps model output tensors to ensemble tensor names
-
string
model_name
¶ The name of the model to execute for this step of the ensemble.
-
int64
model_version
¶ The version of the model to use for inference. If -1 the latest/most-recent version of the model is used.
-
map<string, string>
input_map
¶ Map from name of an input tensor on this step’s model to ensemble tensor name. The ensemble tensor must have the same data type and shape as the model input. Each model input must be assigned to one ensemble tensor, but the same ensemble tensor can be assigned to multiple model inputs.
-
map<string, string>
output_map
¶ Map from name of an output tensor on this step’s model to ensemble tensor name. The data type and shape of the ensemble tensor will be inferred from the model output. It is optional to assign all model outputs to ensemble tensors. One ensemble tensor name can appear in an output map only once.
-
message
-
message
ModelWarmup
¶ Settings used to construct the request sample for model warmup.
-
message
Input
¶ Meta data associated with an input.
-
int64
dims
(repeated)¶ The shape of the input tensor, not including the batch dimension.
-
oneof
input_data_type
¶ Specify how the input data is generated. If the input has STRING data type and ‘random_data’ is set, the data generation will fall back to ‘zero_data’.
-
bool
zero_data
¶ The identifier for using zeros as input data. Note that the value of ‘zero_data’ will not be checked, instead, zero data will be used as long as the field is set.
-
bool
random_data
¶ The identifier for using random data as input data. Note that the value of ‘random_data’ will not be checked, instead, random data will be used as long as the field is set.
-
string
input_data_file
¶ The file whose content will be used as raw input data in row-major order. The file must be provided in a sub-directory ‘warmup’ under the model directory.
-
int64
-
string
name
¶ The name of the request sample.
-
uint32
batch_size
¶ The batch size of the inference request. This must be >= 1. For models that don’t support batching, batch_size must be 1. If batch_size > 1, the ‘inputs’ specified below will be duplicated to match the batch size requested.
-
message
-
message
ModelConfig
¶ A model configuration.
-
string
name
¶ The name of the model.
-
string
platform
¶ The framework for the model. Possible values are “tensorrt_plan”, “tensorflow_graphdef”, “tensorflow_savedmodel”, “caffe2_netdef”, “onnxruntime_onnx”, “pytorch_libtorch” and “custom”.
-
ModelVersionPolicy
version_policy
¶ Policy indicating which version(s) of the model will be served.
-
int32
max_batch_size
¶ Maximum batch size allowed for inference. This can only decrease what is allowed by the model itself. A max_batch_size value of 0 indicates that batching is not allowed for the model and the dimension/shape of the input and output tensors must exactly match what is specified in the input and output configuration. A max_batch_size value > 0 indicates that batching is allowed and so the model expects the input tensors to have an additional initial dimension for the batching that is not specified in the input (for example, if the model supports batched inputs of 2-dimensional tensors then the model configuration will specify the input shape as [ X, Y ] but the model will expect the actual input tensors to have shape [ N, X, Y ]). For max_batch_size > 0 returned outputs will also have an additional initial dimension for the batch.
-
ModelInput
input
(repeated)¶ The inputs request by the model.
-
ModelOutput
output
(repeated)¶ The outputs produced by the model.
-
ModelOptimizationPolicy
optimization
¶ Optimization configuration for the model. If not specified then default optimization policy is used.
-
oneof
scheduling_choice
¶ The scheduling policy for the model. If not specified the default scheduling policy is used for the model. The default policy is to execute each inference request independently.
-
ModelDynamicBatching
dynamic_batching
¶ If specified, enables the dynamic-batching scheduling policy. With dynamic-batching the scheduler may group together independent requests into a single batch to improve inference throughput.
-
ModelSequenceBatching
sequence_batching
¶ If specified, enables the sequence-batching scheduling policy. With sequence-batching, inference requests with the same correlation ID are routed to the same model instance. Multiple sequences of inference requests may be batched together into a single batch to improve inference throughput.
-
ModelEnsembling
ensemble_scheduling
¶ If specified, enables the model-ensembling scheduling policy. With model-ensembling, inference requests will be processed according to the specification, such as an execution sequence of models. The input specified in this model config will be the input for the ensemble, and the output specified will be the output of the ensemble.
-
ModelDynamicBatching
-
ModelInstanceGroup
instance_group
(repeated)¶ Instances of this model. If not specified, one instance of the model will be instantiated on each available GPU.
-
string
default_model_filename
¶ Optional filename of the model file to use if a compute-capability specific model is not specified in
cc_model_filenames
. If not specified the default name is ‘model.graphdef’, ‘model.savedmodel’, ‘model.plan’ or ‘model.netdef’ depending on the model type.
-
map<string, string>
cc_model_filenames
¶ Optional map from CUDA compute capability to the filename of the model that supports that compute capability. The filename refers to a file within the model version directory.
Optional metric tags. User-specific key-value pairs for metrics reported for this model. These tags are applied to the metrics reported on the HTTP metrics port.
-
map<string, ModelParameter>
parameters
¶ Optional model parameters. User-specified parameter values that are made available to custom backends.
-
ModelWarmup
model_warmup
(repeated)¶ Warmup setting of this model. If specified, all instances will be run with the request samples in sequence before serving the model. This field can only be specified if the model is not an ensemble model.
-
string