model_config.proto¶

enum DataType¶

Data types supported for input and output tensors.

enumerator DataType::INVALID = 0¶

enumerator DataType::BOOL = 1¶

enumerator DataType::UINT8 = 2¶

enumerator DataType::UINT16 = 3¶

enumerator DataType::UINT32 = 4¶

enumerator DataType::UINT64 = 5¶

enumerator DataType::INT8 = 6¶

enumerator DataType::INT16 = 7¶

enumerator DataType::INT32 = 8¶

enumerator DataType::INT64 = 9¶

enumerator DataType::FP16 = 10¶

enumerator DataType::FP32 = 11¶

enumerator DataType::FP64 = 12¶

enumerator DataType::STRING = 13¶

message ModelInstanceGroup¶

A group of one or more instances of a model and resources made available for those instances.

enum Kind¶

Kind of this instance group.

enumerator Kind::KIND_AUTO = 0¶: This instance group represents instances that can run on either CPU or GPU. If all GPUs listed in ‘gpus’ are available then instances will be created on GPU(s), otherwise instances will be created on CPU.

enumerator Kind::KIND_GPU = 1¶: This instance group represents instances that must run on the GPU.

enumerator Kind::KIND_CPU = 2¶: This instance group represents instances that must run on the CPU.

enumerator Kind::KIND_MODEL = 3¶: This instance group represents instances that should run on the CPU and/or GPU(s) as specified by the model or backend itself. The inference server will not override the model/backend settings. Currently, this option is supported only for Tensorflow models.

string name¶: Optional name of this group of instances. If not specified the name will be formed as <model name>_<group number>. The name of individual instances will be further formed by a unique instance number and GPU index:

Kind kind¶: The kind of this instance group. Default is KIND_AUTO. If KIND_AUTO or KIND_GPU then both ‘count’ and ‘gpu’ are valid and may be specified. If KIND_CPU or KIND_MODEL only ‘count’ is valid and ‘gpu’ cannot be specified.

int32 count¶: For a group assigned to GPU, the number of instances created for each GPU listed in ‘gpus’. For a group assigned to CPU the number of instances created. Default is 1.

int32 gpus(repeated)¶: GPU(s) where instances should be available. For each GPU listed, ‘count’ instances of the model will be available. Setting ‘gpus’ to empty (or not specifying at all) is eqivalent to listing all available GPUs.

string profile¶: For TensorRT models, using inputs with dynamic shape, this parameter specifies the optimization profile to be set for this instance group. This field should lie between 0 and <TotalNumberOfOptimizationProfilesInPlanModel> - 1 and be specified only for TensorRT backend, otherwise an error will be generated.

message ModelTensorReshape¶

Reshape specification for input and output tensors.

int64 shape(repeated)¶: The shape to use for reshaping.

message ModelInput¶

An input required by the model.

enum Format¶

The format for the input.

enumerator Format::FORMAT_NONE = 0¶: The input has no specific format. This is the default.

enumerator Format::FORMAT_NHWC = 1¶: HWC image format. Tensors with this format require 3 dimensions if the model does not support batching (max_batch_size = 0) or 4 dimensions if the model does support batching (max_batch_size >= 1). In either case the ‘dims’ below should only specify the 3 non-batch dimensions (i.e. HWC or CHW).

enumerator Format::FORMAT_NCHW = 2¶: CHW image format. Tensors with this format require 3 dimensions if the model does not support batching (max_batch_size = 0) or 4 dimensions if the model does support batching (max_batch_size >= 1). In either case the ‘dims’ below should only specify the 3 non-batch dimensions (i.e. HWC or CHW).

string name¶: The name of the input.

DataType data_type¶: The data-type of the input.

Format format¶: The format of the input. Optional.

int64 dims(repeated)¶: The dimensions/shape of the input tensor that must be provided when invoking the inference API for this model.

ModelTensorReshape reshape¶: The shape expected for this input by the backend. The input will be reshaped to this before being presented to the backend. The reshape must have the same number of elements as the input shape specified by ‘dims’. Optional.

message ModelOutput¶

An output produced by the model.

string name¶: The name of the output.

DataType data_type¶: The data-type of the output.

int64 dims(repeated)¶: The dimensions/shape of the output tensor.

ModelTensorReshape reshape¶: The shape produced for this output by the backend. The output will be reshaped from this to the shape specifed in ‘dims’ before being returned in the inference response. The reshape must have the same number of elements as the output shape specified by ‘dims’. Optional.

string label_filename¶: The label file associated with this output. Should be specified only for outputs that represent classifications. Optional.

message ModelVersionPolicy¶

Policy indicating which versions of a model should be made available by the inference server.

message Latest¶

Serve only the latest version(s) of a model. This is the default policy.

uint32 num_versions¶: Serve only the ‘num_versions’ highest-numbered versions. T The default value of ‘num_versions’ is 1, indicating that by default only the single highest-number version of a model will be served.

message All¶: Serve all versions of the model.

message Specific¶

Serve only specific versions of the model.

int64 versions(repeated)¶: The specific versions of the model that will be served.

oneof policy_choice¶

Each model must implement only a single version policy. The default policy is ‘Latest’.

Latest latest¶: Serve only latest version(s) of the model.

All all¶: Serve all versions of the model.

Specific specific¶: Serve only specific version(s) of the model.

message ModelOptimizationPolicy¶

Optimization settings for a model. These settings control if/how a model is optimized and prioritized by the backend framework when it is loaded.

message Graph¶

Enable generic graph optimization of the model. If not specified the framework’s default level of optimization is used. Currently only supported for TensorFlow graphdef and savedmodel models and causes XLA to be enabled/disabled for the model.

int32 level¶

The optimization level. Defaults to 0 (zero) if not specified.

-1: Disabled

0: Framework default

1+: Enable optimization level (greater values indicate higher optimization levels)

enum ModelPriority¶

Model priorities. A model will be given scheduling and execution preference over models at lower priorities. Current model priorities only work for TensorRT models.

enumerator ModelPriority::PRIORITY_DEFAULT = 0¶: The default model priority.

enumerator ModelPriority::PRIORITY_MAX = 1¶: The maximum model priority.

enumerator ModelPriority::PRIORITY_MIN = 2¶: The minimum model priority.

message Cuda¶

CUDA-specific optimization settings.

bool graphs¶: Use CUDA graphs API to capture model operations and execute them more efficiently. Currently only recognized by TensorRT backend.

message ExecutionAccelerators¶

Specify the preferred execution accelerators to be used to execute the model. Currently only recognized by ONNX Runtime backend.

For ONNX Runtime backend, multiple execution accelerators may be set for both GPU and CPU, in such case, the priority will be in the following order:

<gpu_execution_accelerator> (if instance is on GPU) CUDA Execution Provider (if instance is on GPU) <cpu_execution_accelerator> Default CPU Execution Provider

message Accelerator¶

Specify the accelerator to be used to execute the model. Accelerator with the same name may accept different parameters depending on the backends.

string name¶: The name of the execution accelerator.

map<string, string> parameters¶: Additional paremeters used to configure the accelerator.

Accelerator gpu_execution_accelerator(repeated)¶

The preferred execution provider to be used if the model instance is deployed on GPU. The order priority implies priority, i.e. the provider at the front has highest priority.

For ONNX Runtime backend, possible value is “tensorrt” as name, and no parameters are required.

For TensorFlow backend, possible value is “tensorrt” as name, with the following parameters:

“precision_mode” The precission used for optimization. The value can be one of “FP32”, “FP16” and “INT8”. Default value is “FP32”.

“minimum_segment_size” The smallest model subgraph that will be considered for optimization by TensorRT. Default value is 3.

“max_workspace_size_bytes” The maximum GPU memory the model can use temporarily during execution. Default value is 1GB.

Accelerator cpu_execution_accelerator(repeated)¶

The preferred execution provider to be used if the model instance is deployed on CPU, or if the model instance is deployed on GPU but both ‘gpu_execution_accelerator’ and CUDA Execution Provider don’t support the operators. The order priority implies priority, i.e. the provider at the front has highest priority.

For ONNX Runtime backend, possible value is “openvino” as name, and no parameters are required.

Graph graph¶: The graph optimization setting for the model. Optional.

ModelPriority priority¶: The priority setting for the model. Optional.

Cuda cuda¶: CUDA-specific optimization settings. Optional.

ExecutionAccelerators execution_accelerators¶: The accelerators used for the model. Optional.

message ModelDynamicBatching¶

Dynamic batching configuration. These settings control how dynamic batching operates for the model.

int32 preferred_batch_size(repeated)¶: Preferred batch sizes for dynamic batching. If a batch of one of these sizes can be formed it will be executed immediately. If not specified a preferred batch size will be chosen automatically based on model and GPU characteristics.

uint64 max_queue_delay_microseconds¶: The maximum time, in microseconds, a request will be delayed in the scheduling queue to wait for additional requests for batching. Default is 0.

message ModelSequenceBatching¶

Sequence batching configuration. These settings control how sequence batching operates for the model.

uint64 max_sequence_idle_microseconds¶: The maximum time, in microseconds, that a sequence is allowed to be idle before it is aborted. The inference server considers a sequence idle when it does not have any inference request queued for the sequence. If this limit is exceeded, the inference server will free the batch slot allocated by the sequence and make it available for another sequence. If not specified (or specified as zero) a default value of 1000000 (1 second) is used.

message Control¶

A control is a binary signal to a backend.

enum Kind¶

The kind of the control.

enumerator Kind::CONTROL_SEQUENCE_START = 0¶: A new sequence is/is-not starting. If true a sequence is starting, if false a sequence is continuing.

enumerator Kind::CONTROL_SEQUENCE_READY = 1¶: A sequence is/is-not ready for inference. If true the input tensor data is valid and should be used. If false the input tensor data is invalid and inferencing should be “skipped”.

Kind kind¶: The kind of this control.

int32 int32_false_true(repeated)¶: The control’s true and false setting is indicated by setting a value in an int32 tensor. The tensor must be a 1-dimensional tensor with size equal to the batch size of the request. ‘int32_false_true’ must have two entries: the first the false value and the second the true value.

float fp32_false_true(repeated)¶: The control’s true and false setting is indicated by setting a value in a fp32 tensor. The tensor must be a 1-dimensional tensor with size equal to the batch size of the request. ‘fp32_false_true’ must have two entries: the first the false value and the second the true value.

message ControlInput¶

The sequence control values to communicate by a model input.

string name¶: The name of the model input.

Control control(repeated)¶: The control value(s) that should be communicated to the model using this model input.

ControlInput control_input(repeated)¶: The model input(s) that the server should use to communicate sequence start, stop, ready and similar control values to the model.

message ModelEnsembling¶

Model ensembling configuration. These settings specify the models that compose the ensemble and how data flows between the models.

message Step¶: Each step specifies a model included in the ensemble, maps ensemble tensor names to the model input tensors, and maps model output tensors to ensemble tensor names

string model_name¶: The name of the model to execute for this step of the ensemble.

int64 model_version¶: The version of the model to use for inference. If -1 the latest/most-recent version of the model is used.

map<string, string> input_map¶: Map from name of an input tensor on this step’s model to ensemble tensor name. The ensemble tensor must have the same data type and shape as the model input. Each model input must be assigned to one ensemble tensor, but the same ensemble tensor can be assigned to multiple model inputs.

map<string, string> output_map¶: Map from name of an output tensor on this step’s model to ensemble tensor name. The data type and shape of the ensemble tensor will be inferred from the model output. It is optional to assign all model outputs to ensemble tensors. One ensemble tensor name can appear in an output map only once.

Step step(repeated)¶: The models and the input / output mappings used within the ensemble.

message ModelParameter¶

A model parameter.

string string_value¶: The string value of the parameter.

message ModelConfig¶

A model configuration.

string name¶: The name of the model.

string platform¶: The framework for the model. Possible values are “tensorrt_plan”, “tensorflow_graphdef”, “tensorflow_savedmodel”, “caffe2_netdef”, “onnxruntime_onnx”, “pytorch_libtorch” and “custom”.

ModelVersionPolicy version_policy¶: Policy indicating which version(s) of the model will be served.

int32 max_batch_size¶: Maximum batch size allowed for inference. This can only decrease what is allowed by the model itself. A max_batch_size value of 0 indicates that batching is not allowed for the model and the dimension/shape of the input and output tensors must exactly match what is specified in the input and output configuration. A max_batch_size value > 0 indicates that batching is allowed and so the model expects the input tensors to have an additional initial dimension for the batching that is not specified in the input (for example, if the model supports batched inputs of 2-dimensional tensors then the model configuration will specify the input shape as [ X, Y ] but the model will expect the actual input tensors to have shape [ N, X, Y ]). For max_batch_size > 0 returned outputs will also have an additional initial dimension for the batch.

ModelInput input(repeated)¶: The inputs request by the model.

ModelOutput output(repeated)¶: The outputs produced by the model.

ModelOptimizationPolicy optimization¶: Optimization configuration for the model. If not specified then default optimization policy is used.

oneof scheduling_choice¶

The scheduling policy for the model. If not specified the default scheduling policy is used for the model. The default policy is to execute each inference request independently.

ModelDynamicBatching dynamic_batching¶: If specified, enables the dynamic-batching scheduling policy. With dynamic-batching the scheduler may group together independent requests into a single batch to improve inference throughput.

ModelSequenceBatching sequence_batching¶: If specified, enables the sequence-batching scheduling policy. With sequence-batching, inference requests with the same correlation ID are routed to the same model instance. Multiple sequences of inference requests may be batched together into a single batch to improve inference throughput.

ModelEnsembling ensemble_scheduling¶: If specified, enables the model-ensembling scheduling policy. With model-ensembling, inference requests will be processed according to the specification, such as an execution sequence of models. The input specified in this model config will be the input for the ensemble, and the output specified will be the output of the ensemble.

ModelInstanceGroup instance_group(repeated)¶: Instances of this model. If not specified, one instance of the model will be instantiated on each available GPU.

string default_model_filename¶: Optional filename of the model file to use if a compute-capability specific model is not specified in cc_model_filenames. If not specified the default name is ‘model.graphdef’, ‘model.savedmodel’, ‘model.plan’ or ‘model.netdef’ depending on the model type.

map<string, string> cc_model_filenames¶: Optional map from CUDA compute capability to the filename of the model that supports that compute capability. The filename refers to a file within the model version directory.

map<string, string> metric_tags¶: Optional metric tags. User-specific key-value pairs for metrics reported for this model. These tags are applied to the metrics reported on the HTTP metrics port.

map<string, ModelParameter> parameters¶: Optional model parameters. User-specified parameter values that are made available to custom backends.