grpc_service.proto¶

service InferenceService¶

Inference Server GRPC endpoints.

rpc ServerLive(ServerLiveRequest) returns
(ServerLiveResponse): Check liveness of the inference server.

rpc ServerReady(ServerReadyRequest) returns
(ServerReadyResponse): Check readiness of the inference server.

rpc ModelReady(ModelReadyRequest) returns
(ModelReadyResponse): Check readiness of a model in the inference server.

rpc ServerMetadata(ServerMetadataRequest) returns
(ServerMetadataResponse): Get server metadata.

rpc ModelMetadata(ModelMetadataRequest) returns
(ModelMetadataResponse): Get model metadata.

rpc ModelInfer(ModelInferRequest) returns
(ModelInferResponse): Perform inference using a specific model.

rpc ModelStreamInfer(stream ModelInferRequest) returns
(stream ModelStreamInferResponse): Perform streaming inference.

rpc ModelConfig(ModelConfigRequest) returns
(ModelConfigResponse): Get model configuration.

rpc ModelStatistics(
ModelStatisticsRequest)
returns (ModelStatisticsResponse)¶: Get the cumulative inference statistics for a model.

rpc RepositoryIndex(RepositoryIndexRequest) returns
(RepositoryIndexResponse): Get the index of model repository contents.

rpc RepositoryModelLoad(RepositoryModelLoadRequest) returns
(RepositoryModelLoadResponse): Load or reload a model from a repository.

rpc RepositoryModelUnload(RepositoryModelUnloadRequest)¶
returns (RepositoryModelUnloadResponse)¶: Unload a model.

rpc SystemSharedMemoryStatus(
SystemSharedMemoryStatusRequest)
returns (SystemSharedMemoryStatusRespose)¶: Get the status of all registered system-shared-memory regions.

rpc SystemSharedMemoryRegister(
SystemSharedMemoryRegisterRequest)
returns (SystemSharedMemoryRegisterResponse)¶: Register a system-shared-memory region.

rpc SystemSharedMemoryUnregister(
SystemSharedMemoryUnregisterRequest)
returns (SystemSharedMemoryUnregisterResponse)¶: Unregister a system-shared-memory region.

rpc CudaSharedMemoryStatus(
CudaSharedMemoryStatusRequest)
returns (CudaSharedMemoryStatusRespose)¶: Get the status of all registered CUDA-shared-memory regions.

rpc CudaSharedMemoryRegister(
CudaSharedMemoryRegisterRequest)
returns (CudaSharedMemoryRegisterResponse)¶: Register a CUDA-shared-memory region.

rpc CudaSharedMemoryUnregister(
CudaSharedMemoryUnregisterRequest)
returns (CudaSharedMemoryUnregisterResponse)¶: Unregister a CUDA-shared-memory region.

message ServerLiveRequest¶: Request message for ServerLive.

message ServerLiveResponse¶

Response message for ServerLive.

bool live¶: True if the inference server is live, false it not live.

message ServerReadyRequest¶: Request message for ServerReady.

message ServerReadyResponse¶

Response message for ServerReady.

bool ready¶: True if the inference server is ready, false it not ready.

message ModelReadyRequest¶

Request message for ModelReady.

string name¶: The name of the model to check for readiness.

string version¶: The version of the model to check for readiness. If not given the server will choose a version based on the model and internal policy.

message ModelReadyResponse¶

Response message for ModelReady.

bool ready¶: True if the model is ready, false it not ready.

message ServerMetadataRequest¶: Request message for ServerMetadata.

message ServerMetadataResponse¶

Response message for ServerMetadata.

string name¶: The server name.

string version¶: The server version.

string extensions(repeated)¶: The extensions supported by the server.

message ModelMetadataRequest¶

Request message for ModelMetadata.

string name¶: The name of the model.

string version¶: The version of the model to check for readiness. If not given the server will choose a version based on the model and internal policy.

message ModelMetadataResponse¶

Response message for ModelMetadata.

message TensorMetadata¶

Metadata for a tensor.

string name¶: The tensor name.

string datatype¶: The tensor data type.

int64 shape(repeated)¶: The tensor shape. A variable-size dimension is represented by a -1 value.

string name¶: The model name.

string versions(repeated)¶: The versions of the model.

string platform¶: The model’s platform.

TensorMetadata inputs(repeated)¶: The model’s inputs.

TensorMetadata outputs(repeated)¶: The model’s outputs.

message InferParameter¶

An inference parameter value.

oneof parameter_choice¶

The parameter value can be a string, an int64 or a boolean

bool bool_param¶: A boolean parameter value.

int64 int64_param¶: An int64 parameter value.

string string_param¶: A string parameter value.

message InferTensorContents¶

The data contained in a tensor represented by the repeated type that matches the tensor’s data type. Protobuf oneof is not used because oneofs cannot contain repeated fields.

bool bool_contents(repeated)¶: Representation for BOOL data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

int32 int_contents(repeated)¶: Representation for INT8, INT16, and INT32 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

int64 int64_contents(repeated)¶: Representation for INT64 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

uint32 uint_contents(repeated)¶: Representation for UINT8, UINT16, and UINT32 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

uint64 uint64_contents(repeated)¶: Representation for UINT64 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

float fp32_contents(repeated)¶: Representation for FP32 data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

double fp64_contents(repeated)¶: Representation for FP64 data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

bytes byte_contents(repeated)¶: Representation for BYTES data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

message ModelInferRequest¶

Request message for ModelInfer.

message InferInputTensor¶

An input tensor for an inference request.

string name¶: The tensor name.

string datatype¶: The tensor data type.

int64 shape(repeated)¶: The tensor shape.

map<string, InferParameter> parameters¶: Optional inference input tensor parameters.

InferTensorContents contents¶: The tensor contents using a data-type format. This field must not be specified if tensor contents are being specified in ModelInferRequest.raw_input_contents.

message InferRequestedOutputTensor¶

An output tensor requested for an inference request.

string name¶: The tensor name.

map<string, InferParameter> parameters¶: Optional requested output tensor parameters.

string model_name¶: The name of the model to use for inferencing.

string model_version¶: The version of the model to use for inference. If not given the latest/most-recent version of the model is used.

string id¶: Optional identifier for the request. If specified will be returned in the response.

map<string, InferParameter> parameters¶: Optional inference parameters.

InferInputTensor inputs(repeated)¶: The input tensors for the inference.

InferRequestedOutputTensor outputs(repeated)¶: The requested output tensors for the inference. Optional, if not specified all outputs specified in the model config will be returned.

bytes raw_input_contents¶

The data contained in an input tensor can be represented in “raw” bytes form or in the repeated type that matches the tensor’s data type. Using the “raw” bytes form will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see https://github.com/grpc/grpc/issues/23231.

To use the raw representation ‘raw_input_contents’ must be initialized with data for each tensor in the same order as ‘inputs’. For each tensor, the size of this content must match what is expected by the tensor’s shape and data type. The raw data must be the flattened, one-dimensional, row-major order of the tensor elements without any stride or padding between the elements. Note that the FP16 data type must be represented as raw content as there is no specific data type for a 16-bit float type.

If this field is specified then InferInputTensor::contents must not be specified for any input tensor.

message ModelInferResponse¶

Response message for ModelInfer.

message InferOutputTensor¶

An output tensor returned for an inference request.

string name¶: The tensor name.

string datatype¶: The tensor data type.

int64 shape(repeated)¶: The tensor shape.

map<string, InferParameter> parameters¶: Optional output tensor parameters.

InferTensorContents contents¶: The tensor contents using a data-type format. This field must not be specified if tensor contents are being specified in ModelInferResponse.raw_output_contents.

string model_name¶: The name of the model used for inference.

string model_version¶: The version of the model used for inference.

string id¶: The id of the inference request if one was specified.

map<string, InferParameter> parameters¶: Optional inference response parameters.

InferOutputTensor outputs(repeated)¶: The output tensors holding inference results.

bytes raw_output_contents¶

The data contained in an output tensor can be represented in “raw” bytes form or in the repeated type that matches the tensor’s data type. Using the “raw” bytes form will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see https://github.com/grpc/grpc/issues/23231.

To use the raw representation ‘raw_output_contents’ must be initialized with data for each tensor in the same order as ‘outputs’. For each tensor, the size of this content must match what is expected by the tensor’s shape and data type. The raw data must be the flattened, one-dimensional, row-major order of the tensor elements without any stride or padding between the elements. Note that the FP16 data type must be represented as raw content as there is no specific data type for a 16-bit float type.

If this field is specified then InferOutputTensor::contents must not be specified for any output tensor.

message ModelStreamInferResponse¶

Response message for ModelStreamInfer.

string error_message¶: The message describing the error. The empty message indicates the inference was successful without errors.

ModelInferResponse infer_response¶: Holds the results of the request.

message ModelConfigRequest¶

Request message for ModelConfig.

string name¶: The name of the model.

string version¶: The version of the model. If not given the model version is selected automatically based on the version policy.

message ModelConfigResponse¶

Response message for ModelConfig.

ModelConfig config¶: The model configuration.

message ModelStatisticsRequest¶

Request message for ModelStatistics.

string name¶: The name of the model. If not given returns statistics for all models.

string version¶: The version of the model. If not given returns statistics for all model versions.

message StatisticDuration¶

Statistic recording a cumulative duration metric.

uint64 count¶: Cumulative number of times this metric occurred.

uint64 total_time_ns¶: Total collected duration of this metric in nanoseconds.

message InferStatistics¶

Inference statistics.

StatisticDuration success¶: Cumulative count and duration for successful inference request.

StatisticDuration fail¶: Cumulative count and duration for failed inference request.

StatisticDuration queue¶: The count and cumulative duration that inference requests wait in scheduling or other queues.

StatisticDuration compute_input¶: The count and cumulative duration to prepare input tensor data as required by the model framework / backend. For example, this duration should include the time to copy input tensor data to the GPU.

StatisticDuration compute_infer¶: The count and cumulative duration to execute the model.

StatisticDuration compute_output¶: The count and cumulative duration to extract output tensor data produced by the model framework / backend. For example, this duration should include the time to copy output tensor data from the GPU.

message InferBatchStatistics¶

Inference batch statistics.

uint64 batch_size¶: The size of the batch.

StatisticDuration compute_input¶: The count and cumulative duration to prepare input tensor data as required by the model framework / backend with the given batch size. For example, this duration should include the time to copy input tensor data to the GPU.

StatisticDuration compute_infer¶: The count and cumulative duration to execute the model with the given batch size.

StatisticDuration compute_output¶: The count and cumulative duration to extract output tensor data produced by the model framework / backend with the given batch size. For example, this duration should include the time to copy output tensor data from the GPU.

message ModelStatistics¶

Statistics for a specific model and version.

string name¶: The name of the model. If not given returns statistics for all

string version¶: The version of the model.

uint64 last_inference¶: The timestamp of the last inference request made for this model, as milliseconds since the epoch.

uint64 last_inference: The cumulative count of successful inference requests made for this model. Each inference in a batched request is counted as an individual inference. For example, if a client sends a single inference request with batch size 64, “inference_count” will be incremented by 64. Similarly, if a clients sends 64 individual requests each with batch size 1, “inference_count” will be incremented by 64.

uint64 last_inference: The cumulative count of the number of successful inference executions performed for the model. When dynamic batching is enabled, a single model execution can perform inferencing for more than one inference request. For example, if a clients sends 64 individual requests each with batch size 1 and the dynamic batcher batches them into a single large batch for model execution then “execution_count” will be incremented by 1. If, on the other hand, the dynamic batcher is not enabled for that each of the 64 individual requests is executed independently, then “execution_count” will be incremented by 64.

InferStatistics inference_stats¶: The aggregate statistics for the model/version.

InferBatchStatistics batch_stats(repeated)¶: The aggregate statistics for each different batch size that is executed in the model. The batch statistics indicate how many actual model executions were performed and show differences due to different batch size (for example, larger batches typically take longer to compute).

message ModelStatisticsResponse¶

Response message for ModelStatistics.

ModelStatistics model_stats(repeated)¶: Statistics for each requested model.

message RepositoryIndexRequest¶

Request message for RepositoryIndex.

string repository_name¶: The name of the repository. If empty the index is returned for all repositories.

bool ready¶: If true returned only models currently ready for inferencing.

message RepositoryIndexResponse¶

Response message for RepositoryIndex.

message ModelIndex¶

Index entry for a model.

string name¶: The name of the model.

string version¶: The version of the model.

string state¶: The state of the model.

string reason¶: The reason, if any, that the model is in the given state.

ModelIndex models(repeated)¶: An index entry for each model.

message RepositoryModelLoadRequest¶

Request message for RepositoryModelLoad.

string repository_name¶: The name of the repository to load from. If empty the model is loaded from any repository.

string repository_name: The name of the model to load, or reload.

message RepositoryModelLoadResponse¶: Response message for RepositoryModelLoad.

message RepositoryModelUnloadRequest¶

Request message for RepositoryModelUnload.

string repository_name¶: The name of the repository from which the model was originally loaded. If empty the repository is not considered.

string repository_name: The name of the model to unload.

message RepositoryModelUnloadResponse¶: Response message for RepositoryModelUnload.

message SystemSharedMemoryStatusRequest¶

Request message for SystemSharedMemoryStatus.

string name¶: The name of the region to get status for. If empty the status is returned for all registered regions.

message SystemSharedMemoryStatusResponse¶

Response message for SystemSharedMemoryStatus.

message RegionStatus¶

Status for a shared memory region.

string name¶: The name for the shared memory region.

string shared_memory_key¶: The key of the underlying memory object that contains the shared memory region.

uint64 offset¶: Offset, in bytes, within the underlying memory object to the start of the shared memory region.

uint64 byte_size¶: Size of the shared memory region, in bytes.

map<string, RegionStatus> regions¶: Status for each of the registered regions, indexed by region name.

message SystemSharedMemoryRegisterRequest¶

Request message for SystemSharedMemoryRegister.

string name¶: The name of the region to register.

string shared_memory_key¶: The key of the underlying memory object that contains the shared memory region.

uint64 offset¶: Offset, in bytes, within the underlying memory object to the start of the shared memory region.

uint64 byte_size¶: Size of the shared memory region, in bytes.

message SystemSharedMemoryRegisterResponse¶: Response message for SystemSharedMemoryRegister.

message SystemSharedMemoryUnregisterRequest¶

Request message for SystemSharedMemoryUnregister.

string name¶: The name of the system region to unregister. If empty all system shared-memory regions are unregistered.

message SystemSharedMemoryUnregisterResponse¶: Response message for SystemSharedMemoryUnregister.

message CudaSharedMemoryStatusRequest¶

Request message for CudaSharedMemoryStatus.

string name¶: The name of the region to get status for. If empty the status is returned for all registered regions.

message CudaSharedMemoryStatusResponse¶

Response message for CudaSharedMemoryStatus.

message RegionStatus¶

Status for a shared memory region.

string name¶: The name for the shared memory region.

uin64 device_id¶: The GPU device ID where the cudaIPC handle was created.

uint64 byte_size¶: Size of the shared memory region, in bytes.

map<string, RegionStatus> regions¶: Status for each of the registered regions, indexed by region name.

message CudaSharedMemoryRegisterRequest¶

Request message for CudaSharedMemoryRegister.

string name¶: The name of the region to register.

bytes raw_handle¶: The raw serialized cudaIPC handle.

int64 device_id¶: The GPU device ID on which the cudaIPC handle was created.

uint64 byte_size¶: Size of the shared memory block, in bytes.

message CudaSharedMemoryRegisterResponse¶: Response message for CudaSharedMemoryRegister.

message CudaSharedMemoryUnregisterRequest¶

Request message for CudaSharedMemoryUnregister.

string name¶: The name of the cuda region to unregister. If empty all cuda shared-memory regions are unregistered.

message CudaSharedMemoryUnregisterResponse¶: Response message for CudaSharedMemoryUnregister.